Pagination in Power Query

Vojtěch Šíma
Nov 13, 2024
10 min read

Updated: May 12

tl;dr pagination is a form of dividing data into smaller chunks (pages). With this method we can achieve more efficient transfer of data. This technique is often used when working with APIs and retrieving bigger amount of data. In Power Query, we can use simple lists to work with paginated sources or we can utilize features such as List.Generate().

The struggle

Imagine ordering a burger at a new fast-food spot. You’re not sure if they have gluten-free (GF) and vegan options, so you ask the waiter for some choices that meet your needs. Enthusiastically, he starts listing all the GF and vegan items, but he doesn't stop there. He keeps going, listing every item on the menu. After five minutes of listening to options, you didn’t ask for, you settle on the first option he mentioned (because you can’t remember the rest anyway) and then ask what lemonade options they have. Just as excited, he lists every drink they offer, ignoring your request for specifics...

Now, this is hopefully just an imagined story, however, it would be a little frustrating, right? Just like we wouldn’t enjoy this experience in a restaurant, we don’t want the same with our data. Instead of receiving all the data there is, we can specify only the subset we need and get precisely that.

This approach is known as pagination.

Pagination

Pagination is a technique that splits/divides data into smaller chunks (pages). It's commonly used when handling large datasets and APIs, making data streams easier to work with, speeding up data transfer, and improving scalability.

This article covers primarily pagination with relation to APIs

Pages

Each page contains a limited amount of items. The client can request only a single page, or keep requesting pages until he has enough data or the whole dataset. When working with APIs, you can come across different types of pagination.

We will not go through all possible types, feel free to check this or this article to get more details.

Types of Pagination

Offset & Limit

This technique revolves around offset & limit. Here the offset determines the starting position in the dataset and the limit sets the maximum amount of items included in a single page.

Varies on particular API but it can also be named differently such as startAt & maxResults.

GET /api/posts?offset=30&limit=10

This request would skip the first 30 items and then return the 10 next ones.

Token-based

When you request data for the first page, the response includes both the data and a token that points to the next page. You then use this token to continue fetching subsequent pages, looping over each token until you reach the last page (often marked by an isLast field, set to true or false). This method ensures that no data is skipped, and you won’t need to build a custom loop to retrieve all pages.

Request

GET /api/items?limit=5

Response

{"data": ["item1", "item2", "item3", "item4", "item5"], "next_token": "abc123"}

Requesting next page

GET /api/items?limit=5&token=abc123

Response

{"data": ["item6", "item7", "item8", "item9", "item10"], "next_token": "def456"}

Time-based

As the name suggests, this method uses time and date to manage pagination. Depending on the implementation, you can often specify a start_timestamp and end_timestamp to define the exact date range for each data request.

GET /api/data?start_timestamp=2023-01-01T00:00:00&end_timestamp=2023-01-31T23:59:59

Combined pagination

Keep in mind that some APIs may use a combination of pagination methods. It’s always worth reviewing the documentation (if exists, hah) and example responses to identify which approach is available

Implementation in Power Query

Finally, we'll walk through the steps to implement these methods in Power Query.

I’ll use real-life examples, so you might find inspiration if you're working on a similar application. I recommend checking Introduction to API in PQ, if you've never worked with APIs before.

Disclaimer: This is not tutorial to particular APIs, I choose the APIs to demonstrate certain types of pagination, there could be always more optimized way. You can still get inspired tho.

Offset & Limit in PQ

Example is JIRA REST API Cloud v3; Get Issues

In this example, we'll check how many items we have in total, we select our page size and then we will gradually get the pages.

Get Total

The first thing, we can do, is create a default request template as a function, so we don't have to duplicate our steps.

A custom function is a powerful tool for reducing repetitive steps and making processes more versatile and dynamic. Learn more about custom functions here (link coming soon).

request = (offset as number, limit as number) =>  
    Json.Document(  
        Web.Contents(  
        "https://simvojtech.atlassian.net/",  
        [  
            RelativePath="/rest/api/3/search?jql=project=BI",  
            Query =   
                [  
                    startAt=Text.From(offset),  
                    maxResults=Text.From(limit)  
                ],  
            Headers=  
                [  
                    Authorization=basic_auth_string  
                ]  
        ]  
        )  
    )

basic_auth_string is a text paramter where I store my credentials

JIRA uses different keywords for offset and limit. Those are startAt and maxResults

For this example, our function will have only two parameters: offset and limit. While we could fully parameterize everything, we'll keep it simple by focusing only on the components that need to change.

To get the total, let's make a request with the least amount of data required and utilize total field in the response.

getTotal = request(0, 0)

Even tho this will get us no items, we still get our metadata with total field

We confirmed we got total field. Let's modify our step to directly fetch the total.

getTotal = request(0, 0)[total]

Calculate the number of pages

With a clear total number of items, we can decide what maximum items per page (limit) we will use and calculate the amount of needed pages.

Limit will very from API to API, in JIRA Cloud maximum is 100 items (currently) per page, therefore we can use any number from 0 to 100.

In our example, let's go for 100.

pageSize = 100

Let's do the math.

totalPages = Number.RoundDown( getTotal / pageSize )

We round down because we will start from index 0, therefore we need 3 pages (0-100, 101-200, 201-249).

Populating pages

We have a couple of options, on how to populate pages. Either manual or using existing functions such as List.Generate(). Let's do both.

Manual

First, let's get a list of pages using interval.

manualPages = {0..totalPages}

The last step is to populate them with actual data.

populatePages = List.Transform( manualPages, each request(_*pageSize, pageSize))

Here, we iterate over every page (0, 1, 2). For each, we call our function with the HTTP Request, inside which we define offset and limit. The offset is the current value (represented by _) multiplied by page size, so for example 0*100; 1*100; 2*100. The limit is always the same, therefore 100.

An individual response would look like this:

Since we know what the response looks like, we can modify the variable again to fetch issues directly (that's where the actual data is stored).

populatePages = List.Transform( manualPages, each request(_*pageSize, pageSize)[issues])

To finish the transformation, we can use List.Combine(), to retrieve individual items from our pages, so we can validate, that we get all 249 distinct values.

expandPages = List.Combine(populatePages)

At this point, we can do a standard set of steps to retrieve and validate our response.

List.Generate()

List.Generate(initial as function, condition as function, next as function, optional selector as nullable function) as list

This method requires a single function. Here we don't have to create our pages using, however, the initial logic can be harder to understand at first glance.

populatePagesLG = List.Generate(
    ()=> 0, 
    (page) => page <= totalPages, 
    (page) => page + 1, 
    (page) => request(page * pageSize, pageSize)
)

What's happening in the function:

()=> 0
// define initial value, starting point

(page)=> page < totalPages
//as long as the value is less or equal to totalPages, keep iterating (keep generating new values)

(page)=> page+1
// after each iteration, add 1 to the value

(page)=> request(page*pageSize, pageSize)
// at the end of each iteration, run HTTP request with current value*pageSize as offset and pageSize as limit

Here I decided to work with parameter "page" that holds current value. We can define any name or we can also use a keyword each and then work with an underscore (_) as the current value instead of"page".

If you prefer working with each, it would look like this.

populatePagesLGEach = List.Generate(  
    ()=> 0,  
    each _ <= totalPages,  
    each _ + 1,  
    each request(_ * pageSize, pageSize)  
)

each; keyword that acts as shorcut for creating anonymous function with single parameter. Full representation of each would look like this: (_)=>. Main use for each is to make your code shorter and readable. You can see each in generated steps after clicking the UI.

If we would like to mimic the direct access to issues, we can modify the step in the same fashion as in the manual approach.

populatePagesLGEach = List.Generate(  
    ()=> 0,  
    each _ <= totalPages,  
    each _ + 1,  
    each request(_ * pageSize, pageSize)[issues]  
)

Finally, we can expand our lists using List.Combine() the same way.

expandPagesLGEach = List.Combine( populatePagesLGEach )

Time-based implementation in PQ

Example is Power BI REST API; Get Activity Events

This method is rather straightforward, and we won't use many paging hacks here.

Let's set up a template request. We know that the API operates on two parameters:

startDateTime
endDateTime

The server accepts values as text with single quotes ('2024-11-11T00:01:00.000Z')

request = (start as text, end as text) => 
    Json.Document(
        Web.Contents(
            "https://api.powerbi.com", 
            [
                RelativePath = "v1.0/myorg/admin/activityevents", 
                Query = [
                    startDateTime = "'" & start & "'", 
                    endDateTime = "'" & end & "'"
                ], 
                Headers = [
                    Authorization = "Bearer " & temp_token
                ]
            ]
        )
    )

This is our skeleton, we initialized two parameters start, end. These will feed the parameters startDateTime, endDateTime. You can notice, that since the server requires single quotes, we wrap the value in the definition.

temp_token is a generated temporary bearer token as text query

getFilteredData = request("2024-11-11T00:01:00.000Z", "2024-11-11T23:59:59.999Z")

Now, this would be it for many APIs. Fortunately, in our case, we have the opportunity to introduce token-based pagination, because the response looks like this:

Token-based implementation in PQ

Example is Power BI REST API; Get Activity Events; follow up from Time-based pagination

As the name implies, each response includes a token that we’ll use to fetch the next page. In our example, we can either use the continuationToken directly or go with the continuationUri. The latter requires modifying the address to work correctly in Power Query, however, it's recommended to use the provided Uri by the API due to possible DNS issues.

"DNS issues" meas that the host in the suggested next page's URL might differ from the host used in your original request. This discrepancy requires you to translate the address to the preferred host designated by the API developers. Historically, this translation could be faulty in our example endpoint.

To access the next page, we'll use the provided URI. We'll split it into two parts: the new host (which we must unfortunately write statically due to concerns with dynamic sources) and the relative path (this part can and should remain dynamic, as it inherently needs to be). Additionally, we'll remove the timestamp parameters since they are no longer required.

We can construct a new template request using this approach for cleaner and more maintainable code.

nextPageRequest = (uri as text) => 
    Json.Document(
        Web.Contents(
            "https://wabi-west-europe-e-primary-redirect.analysis.windows.net", 
            [
                RelativePath = Text.AfterDelimiter(uri, "/", 2), 
                Headers = [
                    Authorization = "Bearer " & temp_token
                ]
            ]
        )
    )

Text.AfterDelimeter() will gives the second part of the uri, text part after third "/" symbol (the index starts from 0, therefore 2 means third appearance).

removed part: https://wabi-west-europe-e-primar-redirect.analysis.windows.net/

remaining part: v1.0/myorg/admin/activityevents?continuationToken='eyJTdGFydER...'

Now we have all the essentials, and we're ready to start making our calls.

token_based_call = 
    List.Generate(
        () => request("2024-11-11T00:01:00.000Z", "2024-11-11T23:59:59.999"), 
        each not _[lastResultSet], 
        each nextPageRequest(_[continuationToken]), 
        each _[activityEventEntities]
    )

Disclaimer for "not reading, fast copying" visitors, this is not yet final solution

As you can see, we're once again using our reliable List.Generate() function. Let’s break down what’s happening within it.

token_based_call = 
    List.Generate(
        () => initial call using our "request" function with timestamps,

        loop over until a value that indicates last page is TRUE, 

        using continuation token, call function "nextPageRequest",

        for each iteration save data from a field "activityEventEntities"
    )

This works perfectly—until it doesn’t. We expected 3 pages for this particular request, but we’re only getting 2.

The issue is that we’re not retrieving the data from the last page. Since lastResultSet is set to TRUE, the function stops prematurely.

There are several ways to fix this. You could use functions like List.Accumulate() or try writing a recursive function. To keep things straightforward, though, let me introduce a cheeky workaround.

token_based_call = 
List.Generate(
    () => 
        [
            call = request("2024-11-11T00:01:00.000Z", "2024-11-11T23:59:59.999Z"), 
            nextUri = call[continuationUri]?, 
            isLastPage = call[lastResultSet]?
        ], 
    each not [isLastPage], 
    each 
        [
            call = nextPageRequest(nextUri), 
            nextUri = _[call][continuationUri]?, 
            isLastPage = _[call][lastResultSet]?
        ], 
    each [call][activityEventEntities]?
)

This method is inspired by @Štěpán Rešl's library (link).

Now, this you can copy ("not reading, fast copying" visitors)

In this workaround, there aren't as many changes as it might initially seem. Essentially, for each iteration, we create our own record where we control both the data and the iteration condition. This approach allows us to "lag" one iteration behind, allowing us to load and save data for the last page.

Here's a little breakdown of the code:

token_based_call_commented = 
    List.Generate(
        // The initial value: constructing a record
        () => 
            [
                // Initial API call using a timestamp parameters.
                call = request("2024-11-11T00:01:00.000Z", "2024-11-11T23:59:59.999Z"), 
                
                // Extract the "continuationUri" from the call.
                nextUri = call[continuationUri]?, 
                
                // Extract the "lastResultSet" from the call.
                isLastPage = call[lastResultSet]?
            ], 
        
        // Condition: Keep iterating as long as "isLastPage" is not true.
        // The value comes from estabilished record.
        each not [isLastPage], 
        
        // Transform: Define how the state evolves for the next iteration.
        // We use same field names and update them
        each 
            [
                // Call next page request using uri from previous step.
                call = nextPageRequest(nextUri), 
                
                // Update the "continuationUri" from the previous response.
                nextUri = _[call][continuationUri]?, 
                
                // Update the "lastResultSet" from the previous response.

                isLastPage = _[call][lastResultSet]?
            ], 
        
        // Selector: Extract the required data ("activityEventEntities") from each response.
        each [call][activityEventEntities]?
    )

If we would display this approach in a table, it could look like this:

Summary

In today’s in-depth guide, we covered several essential pagination methods—offset-limit, time-based, and token-based pagination. These methods help manage large datasets effectively, especially when working with APIs. We explored how to implement each type in Power Query, using custom functions and tools like List.Generate() to create flexible and efficient pagination systems.

From building pages manually to using dynamic functions, you now have a foundation to design custom pagination strategies in Power Query. Along the way, we also touched on record field manipulation and troubleshooting logic errors that can arise when working with these techniques.

Thank you for reading and go treat yourself to a real (non-imagined) burger!