are children of and in turn they are siblings. A nice family!\n\nThere are a few tags there:\n\n contains metadata for the webpage, for example the page title\n\n is the title of the page\n\n<body> defines the body of the page\n\n<p> is a paragraph\n\n<div> is a division or area of the page\n\n<b> indicates bold font-weight\n\n<a> is a hyperlink, and in the example above it contains two attributes href which indicates the link's destination and an identifier called id.\n\nOK, let us now try to parse this page.\n\nBeautiful Soup\n\nIf we save the content of the HTML document described above and opened it in a browser we will see something like this:\n\nThis is one paragraph.\n\n\n\nThis is another paragraph. HTML is cool!\n\n\n\nDomino Datalab Blog\n\nHowever, we are interested in extracting this information for further use. We could manually copy and paste the data, but fortunately we don't need to - we have Beautiful Soup to help us.\n\nBeautiful Soup is a Python module that is able to make sense of the tags inside HTML and XML documents. You can take a look at the module's page here.\n\nLet us create a string with the content of our HTML. We will see how to read content from a live webpage later on.\n\nWe can now import Beautiful Soup and read the string as follows:\n\nLet us look into the content of html_soup, and as you can see it looks boringly normal:\n\nBut, there is more to it than you may think. Look at the type of the html_soup variable, and as you can imagine, it is no longer a string. Instead, it is a BeautifulSoup object:\n\nAs we mentioned before, Beautiful Soup helps us make sense of the tags in our HTML file. It parses the document and locates the relevant tags. We can for instance directly ask for the title of the website:\n\nOr for the text inside the title tag:\n\nSimilarly, we can look at the children of the body tag:\n\nFrom here, we can select the content of the first paragraph. From the list above we can see that it is the second element in the list. Remember that Python counts from 0 , so we are interested in element number 1:\n\nThis works fine, but Beautiful Soup can help us even more. We can for instance find the first paragraph by referring to the p tag as follows:\n\nWe can also look for all the paragraph instances:\n\nLet us obtain the hyperlink referred to in our example HTML. We can do this by requesting all the a tags that contain an href:\n\nIn this case the contents of the list links are tags themselves. Our list contains a single element and we can see its type:\n\nWe can therefore request the attributes href and id as follows:\n\nReading the source code of a webpage\n\nWe are now ready to start looking into requesting information from an actual webpage. We can do this with the help of the Requests module. Let us read the content of a previous blog post, for instance, the one on \"Data Exploration with Pandas Profiler and D-Tale\"\n\nA successful request of the page will return a response 200 :\n\nThe content of the page can be seen with my_page.content. I will not show this as it will be a messy entry for this post, but you can go ahead and try it in your environment.\n\nWhat we really want is to pass this information to Beautiful Soup so we can make sense of the tags in the document:\n\nLet us look at the heading tag h1 that contains the heading of the page:\n\nWe can see that it has a few attributes and what we really want is the text inside the tag:\n\nThe author of the blog post is identified in the div of class author-link, let's take a look:\n\nNote that we need to refer to class_ (with an underscore) to avoid clashes with the Python reserved word class. As we can see from the result above, the div has a hyperlink and the name of the author is in the text of that tag:\n\nAs you can see, we need to get well acquainted with the content of the source code of our page. You can use the tools that your favourite browser gives you to inspect elements of a website.\n\nLet's say that we are now interested in getting the list of goals given in the blog post. The information is in a <ul> tag which is an unordered list, and each entry is in a <li> tag, which is a list item. The unordered list has no class or role (unlike other lists in the page):\n\nOK, we can now extract the entries for the HTML list and put them in a Python list:\n\nAs mentioned before, we could be interested in getting the text of the blog post to carry out some natural language processing. We can do that in one go with the get_text() method.\n\nWe can now use some of the techniques described in the earlier post on natural language with spaCy. In this case we are showing each entry, its part-of-speech (POS), the explanation for POS and whether the entry is considered a stop word or not. For convenience we are only showing the first 10 entries.\n\nReading table data\n\nFinally, let us use what we have learned so far to get data that can be shown in the form of a table. We mentioned at the beginning of this post that we may want to see the number of gold medals obtained by different countries in the Olympic Games in Tokyo. We can read that information from the relevant entry in Wikipedia.\n\nUsing the inspect element functionality of my browser I can see that the table where the data is located has a class. See the screenshot from my browser:\n\nIn this case, we need to iterate through each row (tr) and then assign each of its elements (td) to a variable and append it to a list. One exception is the heading of the table which hasthelements.\n\nLet us find all the rows. We will single out the first one of those to extract the headers, and we'll store the medal information in a variable called allRows:\n\nLet us look a the first row:\n\nAs you can see, we need to find all the th tags and get the text, and furthermore we will get rid of heading and trailing spaces with the strip() method. We do all this within a list comprehension syntax:\n\nCool! We now turn our attention to the medals:\n\nHang on... that looks great but it does not have the names of the countries. Let us see the content of allRows:\n\nAha! The name of the country is in a th tag, and actually we can extract it from the string inside the hyperlink:\n\nYou can see from the table in the website that some countries have the same number of gold, silver and bronze medals and thus are given the same rank. See for instance rank 36 given to both Greece and Uganda. This has some implications in our data scraping strategy, let look at the results for entries 35 to 44 :\n\nOur rows have five entries, but those that have the same ranking actually have four entries. These entries have a rowspan attribute as shown in the screenshot for rank 36 below:\n\nLet us find the entries that have a rowspan attribute and count the number of countries that share the same rank. We will keep track of the entry number, the td number, the number of countries that share the same rank and rank assigned:\n\nWe can now fix our results by inserting the correct rank in the rows that have missing values:\n\nLet us check that this worked:\n\nWe can now insert the names of the countries too:\n\nFinally, we can use our data to create a Pandas dataframe:\n\nSummary\n\nWe have seen how to parse an HTML document and make sense of the tags within it with the help of Beautiful Soup. You may want to use some of the things you have learned here to get some data that otherwise may be only available in a webpage. Please remember that you should be mindful of the rights of the material you are obtaining. Read the terms and conditions of the pages you are interested in, and if in doubt it is better to err on the side of caution. One last word - web scraping depends on the given structure of the webpages you are parsing. If the pages change, it is quite likely that your code will fail. In this case, be ready to roll up your sleeves, re-inspect the HTML tags, and fix your code accordingly.","wordCount":2064,"timeRequired":"PT13M"}</script><script> (function (w, d, s, l, i) { w[l] = w[l] || []; w[l].push({ "gtm.start": new Date().getTime(), event: "gtm.js" }); var f = d.getElementsByTagName(s)[0], j = d.createElement(s), dl = l != "dataLayer" ? "&l=" + l : ""; j.async = true; j.src = "https://www.googletagmanager.com/gtm.js?id=" + i + dl; f.parentNode.insertBefore(j, f); })(window, document, "script", "dataLayer", "GTM-N4Q7ZX4" ); </script><link rel="preconnect" href="https://api.algolia.com"/><link rel="dns-prefetch" href="https://api.algolia.com"/><link rel="prefetch" href="/blog"/><link rel="prefetch" href="/resources"/><link rel="prefetch" href="/customers"/><link rel="prefetch" href="/solutions"/><link rel="prefetch" href="https://domino.ai/platform"/><link rel="prefetch" href="/resources"/><script type="text/javascript" src="//js.hsforms.net/forms/embed/v2.js" defer=""></script><script src="https://js.hscta.net/cta/current.js" defer=""></script><link rel="preconnect" href="https://dev.visualwebsiteoptimizer.com"/><script type="text/javascript" id="vwoCode"> window._vwo_code || (function() { var account_id=1102922, version=2.1, settings_tolerance=2000, hide_element='body', hide_element_style = 'opacity:0 !important;filter:alpha(opacity=0) !important;background:none !important;transition:none !important;', /* DO NOT EDIT BELOW THIS LINE */ f=false,w=window,d=document,v=d.querySelector('#vwoCode'),cK='_vwo_'+account_id+'_settings',cc={};try{var c=JSON.parse(localStorage.getItem('_vwo_'+account_id+'_config'));cc=c&&typeof c=='object'?c:{}}catch(e){}var stT=cc.stT==='session'?w.sessionStorage:w.localStorage;code={nonce:v&&v.nonce,library_tolerance:function(){return typeof library_tolerance!=='undefined'?library_tolerance:undefined},settings_tolerance:function(){return cc.sT||settings_tolerance},hide_element_style:function(){return'{'+(cc.hES||hide_element_style)+'}'},hide_element:function(){if(performance.getEntriesByName('first-contentful-paint')[0]){return''}return typeof cc.hE=='string'?cc.hE:hide_element},getVersion:function(){return version},finish:function(e){if(!f){f=true;var t=d.getElementById('_vis_opt_path_hides');if(t)t.parentNode.removeChild(t);if(e)(new Image).src='https://dev.visualwebsiteoptimizer.com/ee.gif?a='+account_id+e}},finished:function(){return f},addScript:function(e){var t=d.createElement('script');t.type='text/javascript';if(e.src){t.src=e.src}else{t.text=e.text}v&&t.setAttribute('nonce',v.nonce);d.getElementsByTagName('head')[0].appendChild(t)},load:function(e,t){var n=this.getSettings(),i=d.createElement('script'),r=this;t=t||{};if(n){i.textContent=n;d.getElementsByTagName('head')[0].appendChild(i);if(!w.VWO||VWO.caE){stT.removeItem(cK);r.load(e)}}else{var o=new XMLHttpRequest;o.open('GET',e,true);o.withCredentials=!t.dSC;o.responseType=t.responseType||'text';o.onload=function(){if(t.onloadCb){return t.onloadCb(o,e)}if(o.status===200||o.status===304){_vwo_code.addScript({text:o.responseText})}else{_vwo_code.finish('&e=loading_failure:'+e)}};o.onerror=function(){if(t.onerrorCb){return t.onerrorCb(e)}_vwo_code.finish('&e=loading_failure:'+e)};o.send()}},getSettings:function(){try{var e=stT.getItem(cK);if(!e){return}e=JSON.parse(e);if(Date.now()>e.e){stT.removeItem(cK);return}return e.s}catch(e){return}},init:function(){if(d.URL.indexOf('__vwo_disable__')>-1)return;var e=this.settings_tolerance();w._vwo_settings_timer=setTimeout(function(){_vwo_code.finish();stT.removeItem(cK)},e);var t;if(this.hide_element()!=='body'){t=d.createElement('style');var n=this.hide_element(),i=n?n+this.hide_element_style():'',r=d.getElementsByTagName('head')[0];t.setAttribute('id','_vis_opt_path_hides');v&&t.setAttribute('nonce',v.nonce);t.setAttribute('type','text/css');if(t.styleSheet)t.styleSheet.cssText=i;else t.appendChild(d.createTextNode(i));r.appendChild(t)}else{t=d.getElementsByTagName('head')[0];var i=d.createElement('div');i.style.cssText='z-index: 2147483647 !important;position: fixed !important;left: 0 !important;top: 0 !important;width: 100% !important;height: 100% !important;background: white !important;display: block !important;';i.setAttribute('id','_vis_opt_path_hides');i.classList.add('_vis_hide_layer');t.parentNode.insertBefore(i,t.nextSibling)}var o=window._vis_opt_url||d.URL,s='https://dev.visualwebsiteoptimizer.com/j.php?a='+account_id+'&u='+encodeURIComponent(o)+'&vn='+version;if(w.location.search.indexOf('_vwo_xhr')!==-1){this.addScript({src:s})}else{this.load(s+'&x=true')}}};w._vwo_code=code;code.init();})(); </script><!--end head--></head><body><div id="root"><div class="font-sans"><nav aria-label="Header navigation" class="fixed scrollbar-hide overflow-auto xl:overflow-visible xl:flex border-b border-[#5e6d87] items-center w-full shrink-0 z-[1000] bg-black"><div class="container mx-auto xl:flex"><div class="flex sm:flex-row items-center justify-between shrink-0 w-full xl:w-auto h-17.5 xl:h-20 px-3.75 xl:px-0 border-b border-dark-border xl:border-none"><a href="https://domino.ai/" class="flex" aria-label="Domino Data Lab"><img src="https://dominodatalab-git-main-domino-data-lab.vercel.app/images/logo.svg" alt="Domino Data Lab" width="120" height="23" class=" "/></a><div class="whitespace-nowrap flex-shrink-0 text-white sm:flex xl:hidden ml-auto"><a href="https://domino.ai/contactus" to="/contactus" class="group inline-flex items-center justify-center text-sm h-auto py-2 rounded-full monospace transition-colors bg-[#57545F] hover:bg-gray-900 !border-0 active:bg-gray-800 text-white mainDarkGray h-12.5 md:!h-12.5 xl:!h-10 px-7 py-2.5 mx-2.5 xl:mx-0 w-full xl:w-auto flex rounded-full" tabindex="0"><span class="text-xs md:text-xs lg:text-sm text-button whitespace-nowrap">Contact us</span></a><a href="https://domino.ai/demo" to="/demo" class="group inline-flex items-center justify-center text-sm h-auto py-2 rounded-full monospace transition-colors text-dark-background bg-accent-2 hover:bg-accent-2-hover active:bg-accent-2-pressed mainOrange h-12.5 md:!h-12.5 xl:!h-10 px-7 py-2.5 w-full xl:w-auto flex rounded-full" tabindex="0"><span class="text-xs md:text-xs lg:text-sm text-button whitespace-nowrap">Watch Demo</span></a></div></div><div class="hidden xl:flex flex-col xl:flex-row justify-between flex-grow items-center overflow-hidden"><ul class="h-full scrollbar-hide flex flex-col xl:flex-row overflow-scroll w-full ml-auto lg:ml-12"></ul><div class="flex flex-col md:flex-row md:hidden xl:flex items-center gap-y-3 md:gap-y-0 gap-x-2 mt-5 xl:mt-0 w-full xl:w-auto px-5 xl:px-0"><a href="https://domino.ai/contactus" to="/contactus" class="group inline-flex items-center justify-center text-sm h-auto py-2 rounded-full monospace transition-colors bg-[#57545F] hover:bg-gray-900 !border-0 active:bg-gray-800 text-white mainDarkGray h-12.5 md:!h-12.5 xl:!h-10 px-7 py-2.5 mx-2.5 xl:mx-0 w-full xl:w-auto flex rounded-full" tabindex="0"><span class="text-xs md:text-xs lg:text-sm text-button whitespace-nowrap">Contact us</span></a><a href="https://domino.ai/demo" to="/demo" class="group inline-flex items-center justify-center text-sm h-auto py-2 rounded-full monospace transition-colors text-dark-background bg-accent-2 hover:bg-accent-2-hover active:bg-accent-2-pressed mainOrange h-12.5 md:!h-12.5 xl:!h-10 px-7 py-2.5 w-full xl:w-auto flex rounded-full" tabindex="0"><span class="text-xs md:text-xs lg:text-sm text-button whitespace-nowrap">Watch Demo</span></a></div></div></div></nav><div class="main-wrapper pt-17.5 xl:pt-20"><main><section class="relative bg-white py-10 lg:py-12.5 px-3.75 md:px-10 lg:px-25 pb-0 md:pb-0 lg:pb-0"><div class="mx-auto w-full max-w-[1300px] text-left"><div class="lg:max-w-[80%]"><div class="flex flex-wrap items-center gap-x-4 gap-y-2 mb-8"><span class="inline-flex items-center rounded-full border border-gray-200 bg-gray-50 px-2.5 py-1 text-xs font-medium uppercase tracking-wide text-gray-900">Data Science</span><span class="inline-flex items-center rounded-full border border-gray-200 bg-gray-50 px-2.5 py-1 text-xs font-medium uppercase tracking-wide text-gray-900">Code</span><span class="inline-flex items-center rounded-full border border-gray-200 bg-gray-50 px-2.5 py-1 text-xs font-medium uppercase tracking-wide text-gray-900">Data Engineering</span></div><div class="flex items-center gap-2 mb-4"><span class="text-sm text-gray-500">September 30, 2021 | 13 min read</span></div><div><h1 class="font-bold text-dark-background text-3xl md:text-4xl lg:text-5xl leading-tight balance-text mb-6 lg:mb-16">Getting Data with Beautiful Soup</h1></div><div class="flex flex-wrap items-center justify-between gap-6"><div class="flex items-center gap-4"><img loading="lazy" decoding="async" alt="Dr J Rogel-Salazar" class="w-12 h-12 shrink-0 rounded-full object-cover" style="--img-aspect-ratio:1;--img-natural-width:400px" src="https://cdn.sanity.io/images/kuana2sp/production-main/7414bb869436e197d780c0812f160c6032fcc42e-400x400.jpg?w=80&fit=max&auto=format" srcSet="https://cdn.sanity.io/images/kuana2sp/production-main/7414bb869436e197d780c0812f160c6032fcc42e-400x400.jpg?w=80&fit=max&auto=format 80w, https://cdn.sanity.io/images/kuana2sp/production-main/7414bb869436e197d780c0812f160c6032fcc42e-400x400.jpg?w=160&fit=max&auto=format 160w, https://cdn.sanity.io/images/kuana2sp/production-main/7414bb869436e197d780c0812f160c6032fcc42e-400x400.jpg?w=240&fit=max&auto=format 240w" sizes="(max-width: 80px) 100vw, 80px" width="400" height="400" data-loaded="false" data-above-fold="false" rel="" fetchpriority="low"/><div><a href="https://domino.ai/blog/author/jrogel" to="/blog/author/jrogel" class="overlay-link font-semibold text-dark-background text-base" tabindex="0">Dr J Rogel-Salazar</a></div></div><div class="flex items-center gap-3"><a href="https://www.facebook.com/share.php?u=https://domino.ai/blog/getting-data-with-beautiful-soup" target="_blank" rel="noreferrer" aria-label="Facebook" class="group flex items-center justify-center w-10 h-10 rounded-full text-gray-700 transition-colors hover:bg-accent-2 hover:text-accent-2 [&_img]:transition-[filter] group-hover:[&_img]:brightness-0 group-hover:[&_img]:invert"><img loading="lazy" decoding="async" alt="" style="--img-aspect-ratio:1;--img-natural-width:29px" src="https://cdn.sanity.io/images/kuana2sp/production-main/3990faa9df52a6ca86fd82d22d91a31864d91991-29x29.svg?w=20&fit=max&auto=format" srcSet="https://cdn.sanity.io/images/kuana2sp/production-main/3990faa9df52a6ca86fd82d22d91a31864d91991-29x29.svg?w=20&fit=max&auto=format 20w" sizes="(max-width: 20px) 100vw, 20px" width="29" height="29" data-loaded="false" data-above-fold="false" rel="" fetchpriority="low"/></a><a href="https://twitter.com/intent/tweet?url=https://domino.ai/blog/getting-data-with-beautiful-soup" target="_blank" rel="noreferrer" aria-label="Twitter" class="group flex items-center justify-center w-10 h-10 rounded-full text-gray-700 transition-colors hover:bg-accent-2 hover:text-accent-2 [&_img]:transition-[filter] group-hover:[&_img]:brightness-0 group-hover:[&_img]:invert"><img loading="lazy" decoding="async" alt="" style="--img-aspect-ratio:1;--img-natural-width:29px" src="https://cdn.sanity.io/images/kuana2sp/production-main/7fdd5d0b59f965208ec834db1e73e11a7e884f98-29x29.svg?w=20&fit=max&auto=format" srcSet="https://cdn.sanity.io/images/kuana2sp/production-main/7fdd5d0b59f965208ec834db1e73e11a7e884f98-29x29.svg?w=20&fit=max&auto=format 20w" sizes="(max-width: 20px) 100vw, 20px" width="29" height="29" data-loaded="false" data-above-fold="false" rel="" fetchpriority="low"/></a><a href="http://www.linkedin.com/shareArticle?mini=true&url=https://domino.ai/blog/getting-data-with-beautiful-soup" target="_blank" rel="noreferrer" aria-label="LinkedIn" class="group flex items-center justify-center w-10 h-10 rounded-full text-gray-700 transition-colors hover:bg-accent-2 hover:text-accent-2 [&_img]:transition-[filter] group-hover:[&_img]:brightness-0 group-hover:[&_img]:invert"><img loading="lazy" decoding="async" alt="" style="--img-aspect-ratio:1;--img-natural-width:29px" src="https://cdn.sanity.io/images/kuana2sp/production-main/fbb9fe5b4586b8344f2ef2dd52f5b796919494a7-29x29.svg?w=20&fit=max&auto=format" srcSet="https://cdn.sanity.io/images/kuana2sp/production-main/fbb9fe5b4586b8344f2ef2dd52f5b796919494a7-29x29.svg?w=20&fit=max&auto=format 20w" sizes="(max-width: 20px) 100vw, 20px" width="29" height="29" data-loaded="false" data-above-fold="false" rel="" fetchpriority="low"/></a></div></div><div class="h-[3px] w-full shrink-0 mt-8" style="background:linear-gradient(to right, #ff9421 0%, #f8475e 34%, #f020b3 66%, #8636f8 100%)"></div></div></div></section><section class="mx-auto py-10 lg:py-12.5 bg-white px-3.75 md:px-10 lg:px-25"><div class="mx-auto max-w-[1300px]"><div class="lg:flex lg:justify-between lg:items-start"><div class="lg:w-[60%]"><a class="overlay-link uppercase text-xs font-mono text-dark-background hover:text-gray-500 active:text-gray-600" href="/blog" style="margin-bottom:2rem;display:inline-block;position:relative;left:-10px"><svg style="margin-right:.8rem;display:inline-block;rotate:180deg" width="14" height="10" fill="none" xmlns="http://www.w3.org/2000/svg" class="h-2.5 ml-2.5 stroke-accent-2 group-hover:stroke-accent-hover group-active:stroke-dark-background" stroke="#FF6543" viewBox="0 0 14 10"><path d="M8.97.97 13 5m0 0L8.97 9.03M13 5H0"></path></svg>Return to blog home</a><div><p id="body__f3dfb12a9a5f" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Data is all around us, from the spreadsheets we analyse on a daily basis, to the weather forecast we rely on every morning or the webpages we read. In many cases, the data we consume is simply given to us, and a simple glance is enough to make a decision. For example, knowing that the chance of rain today is 75% all day makes me take my umbrella with me. In many other cases, the data provided is so rich that we need to roll up our sleeves and we may use some exploratory analysis to get our heads around it. We have talked about some useful packages to do this exploration in <a href="https://domino.ai/blog/data-exploration-with-pandas-profiler-and-d-tale" to="/blog/data-exploration-with-pandas-profiler-and-d-tale" class="group transition-colors inline text-accent hover:text-accent-hover active:text-dark-background items-center" tabindex="0">a previous post</a>.</p><p id="body__33a806296238" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">However, the data we require may not always be given to us in a format that is suitable for immediate manipulation. It may be the case that the data can be obtained from an Application Programming Interface (API). Or we may connect directly to a database to obtain the information we require.</p><p id="body__d188bfadea54" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Another rich source of data is the web and you may have obtained some useful data points from it already. Simply visit your favourite Wikipedia page and you may discover how many gold medals each country has won in the recent Olympic Games in Tokyo. Webpages are also rich in textual content and although you may copy and paste this information, or even type it into your text editor of choice, web scraping may be a method to consider. In another previous post we talked about <a href="https://domino.ai/blog/natural-language-in-python-using-spacy" to="/blog/natural-language-in-python-using-spacy" class="group transition-colors inline text-accent hover:text-accent-hover active:text-dark-background items-center" tabindex="0">natural language processing</a> and extracted text from some webpages. In this post we are going to use a Python module called <a href="https://www.crummy.com/software/BeautifulSoup/" to="https://www.crummy.com/software/BeautifulSoup/" target="_blank" rel="noopener noreferrer" class="group transition-colors inline text-accent hover:text-accent-hover active:text-dark-background items-center" tabindex="0">Beautiful Soup</a> to facilitate the process of data acquisition.</p><h2 id="body__118c4cdccdca" class="text-h2-small lg:text-h2-small-lg balance-text !leading-[1.2] font-medium text-dark-background mb-5 mt-10 first:mt-0">Web scraping</h2><p id="body__51b2c6b1aa7d" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">We can create a program that enables us to grab the pages we are interested in and obtain the information we are after. This is known as web scraping and the code we write requires us to obtain the source code of the web pages that contain the information. In other words, we need to parse the HTML that makes up the page to extract the data. In a nutshell, we need to complete the following steps:</p><ol id="body__e10761ef1670-parent" class="list-decimal list-inside my-4"><li class="pl-8 mt-5 flex pt-list-item text-normal !leading-[1.8] lg:text-normal-lg"><div class="mr-5 shrink-0 grow-0 list-bullet"><span class="w-2.5 h-2.5 rounded-full mt-2 bg-[#777384] inline-block"></span></div><span>Identify the webpage with the information we need</span></li><li class="pl-8 mt-5 flex pt-list-item text-normal !leading-[1.8] lg:text-normal-lg"><div class="mr-5 shrink-0 grow-0 list-bullet"><span class="w-2.5 h-2.5 rounded-full mt-2 bg-[#777384] inline-block"></span></div><span>Download the source code</span></li><li class="pl-8 mt-5 flex pt-list-item text-normal !leading-[1.8] lg:text-normal-lg"><div class="mr-5 shrink-0 grow-0 list-bullet"><span class="w-2.5 h-2.5 rounded-full mt-2 bg-[#777384] inline-block"></span></div><span>Identify the elements of the page that hold the information we need</span></li><li class="pl-8 mt-5 flex pt-list-item text-normal !leading-[1.8] lg:text-normal-lg"><div class="mr-5 shrink-0 grow-0 list-bullet"><span class="w-2.5 h-2.5 rounded-full mt-2 bg-[#777384] inline-block"></span></div><span>Extract and clean the information</span></li><li class="pl-8 mt-5 flex pt-list-item text-normal !leading-[1.8] lg:text-normal-lg"><div class="mr-5 shrink-0 grow-0 list-bullet"><span class="w-2.5 h-2.5 rounded-full mt-2 bg-[#777384] inline-block"></span></div><span>Format and save the data for further analysis</span></li></ol><p id="body__f98fb7ddba49" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Please note that not all pages let you scrape their content and others do not offer a clear cut view on this. We recommend that you check the terms and conditions for the pages you are after and adhere to them. It may be the case that there is an API you can use to get the data, and there are often additional benefits for using it instead of scraping directly.</p><h2 id="body__31bc966b61ce" class="text-h2-small lg:text-h2-small-lg balance-text !leading-[1.2] font-medium text-dark-background mb-5 mt-10 first:mt-0">HTML Primer</h2><p id="body__dab11ebb9cb8" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">As mentioned above, we will need to understand the structure of an HTML file to find our way around it. The way a webpage renders its content is described via HTML (or HyperText Markup Language), which provides detailed instructions indicating the format, style, and structure for the pages so that a browser can render things correctly.</p><p id="body__cabbbaace5e4" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">HTML uses tags to flag key structure elements. A tag is denoted by using the <code><</code> and <code>></code> symbols. We are also required to indicate where the tagged elements start and finish. For a tag called<code>mytag</code>, we denote the beginning of the tagged content as<code><mytag></code>and its end with<code></mytag></code>.</p><p id="body__aa13dc72d4ba" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">The most basic HTML tag is the <code><html></code> tag, and it tells the browser that everything between the tags is HTML. The simplest HTML document is therefore defined as:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full"><html></html></code></pre></div></div></div></div><p id="body__ee38fbd12bef" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">The document above is empty. Let us look at a more useful example:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full"><html> <head> <title>My HTML page</title> </head> <body> <p> This is one paragraph. </p> <p> This is another paragraph. <b>HTML</b> is cool! </p> <div> <a href="https://blog.dominodatalab.com/" id="dominodatalab">Domino Datalab Blog</a> </div> </body> </html></code></pre></div></div></div></div><p id="body__e4b0277b35ad" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">We can see the HTML tag we had before. This time we have other tags inside it. We call tags inside another tag "children," and as you can imagine, tags can have "parents". In the document above <code><head></code> and <code><body></code> are children of <code><html></code> and in turn they are siblings. A nice family!</p><p id="body__75ba6bc80ae4" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">There are a few tags there:</p><ul class="list-none my-4"><li class="pl-8 mt-5 flex pt-list-item text-normal !leading-[1.8] lg:text-normal-lg"><div class="mr-5 shrink-0 grow-0 list-bullet"><span class="w-2.5 h-2.5 rounded-full mt-2 bg-[#777384] inline-block"></span></div><span><code><head></code> contains metadata for the webpage, for example the page title</span></li><li class="pl-8 mt-5 flex pt-list-item text-normal !leading-[1.8] lg:text-normal-lg"><div class="mr-5 shrink-0 grow-0 list-bullet"><span class="w-2.5 h-2.5 rounded-full mt-2 bg-[#777384] inline-block"></span></div><span><code><title></code> is the title of the page</span></li><li class="pl-8 mt-5 flex pt-list-item text-normal !leading-[1.8] lg:text-normal-lg"><div class="mr-5 shrink-0 grow-0 list-bullet"><span class="w-2.5 h-2.5 rounded-full mt-2 bg-[#777384] inline-block"></span></div><span><code><body></code> defines the body of the page</span></li><li class="pl-8 mt-5 flex pt-list-item text-normal !leading-[1.8] lg:text-normal-lg"><div class="mr-5 shrink-0 grow-0 list-bullet"><span class="w-2.5 h-2.5 rounded-full mt-2 bg-[#777384] inline-block"></span></div><span><code><p></code> is a paragraph</span></li><li class="pl-8 mt-5 flex pt-list-item text-normal !leading-[1.8] lg:text-normal-lg"><div class="mr-5 shrink-0 grow-0 list-bullet"><span class="w-2.5 h-2.5 rounded-full mt-2 bg-[#777384] inline-block"></span></div><span><code><div></code> is a division or area of the page</span></li><li class="pl-8 mt-5 flex pt-list-item text-normal !leading-[1.8] lg:text-normal-lg"><div class="mr-5 shrink-0 grow-0 list-bullet"><span class="w-2.5 h-2.5 rounded-full mt-2 bg-[#777384] inline-block"></span></div><span><code><b></code> indicates bold font-weight</span></li><li class="pl-8 mt-5 flex pt-list-item text-normal !leading-[1.8] lg:text-normal-lg"><div class="mr-5 shrink-0 grow-0 list-bullet"><span class="w-2.5 h-2.5 rounded-full mt-2 bg-[#777384] inline-block"></span></div><span><code><a></code> is a hyperlink, and in the example above it contains two attributes <code>href</code> which indicates the link's destination and an identifier called <code>id</code>.</span></li></ul><p id="body__9a64e5bc24e8" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">OK, let us now try to parse this page.</p><h2 id="body__52594aa4ac4d" class="text-h2-small lg:text-h2-small-lg balance-text !leading-[1.2] font-medium text-dark-background mb-5 mt-10 first:mt-0">Beautiful Soup</h2><p id="body__5deebaff8c8f" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">If we save the content of the HTML document described above and opened it in a browser we will see something like this:</p><blockquote class="pl-8 mb-4 ml-4 mr-10 relative">This is one paragraph.<br/><br/><br/><br/>This is another paragraph. <strong>HTML</strong> is cool!<br/><br/><br/><br/><a href="https://domino.ai/blog" to="/blog" class="group transition-colors inline text-accent hover:text-accent-hover active:text-dark-background items-center" tabindex="0">Domino Datalab Blog</a></blockquote><p id="body__5fac2f41246f" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">However, we are interested in extracting this information for further use. We could manually copy and paste the data, but fortunately we don't need to - we have Beautiful Soup to help us.</p><p id="body__832ab5ca8220" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Beautiful Soup is a Python module that is able to make sense of the tags inside HTML and XML documents. You can take a look at the module's page <a href="https://beautiful-soup-4.readthedocs.io/en/latest/" to="https://beautiful-soup-4.readthedocs.io/en/latest/" target="_blank" rel="noopener noreferrer" class="group transition-colors inline text-accent hover:text-accent-hover active:text-dark-background items-center" tabindex="0">here</a>.</p><p id="body__a540650c6eda" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Let us create a string with the content of our HTML. We will see how to read content from a live webpage later on.</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">my_html = """ <html> <head> <title>My HTML page</title> </head> <body> <p> This is one paragraph. </p> <p> This is another paragraph. <b>HTML</b> is cool! </p> <div> <a href="https://blog.dominodatalab.com/" id="dominodatalab">Domino Datalab Blog</a> </div> </body> </html>"""</code></pre></div></div></div></div><p id="body__a834e211eefa" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">We can now import Beautiful Soup and read the string as follows:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">from bs4 import BeautifulSoup html_soup = BeautifulSoup(my_html, 'html.parser')</code></pre></div></div></div></div><p id="body__2ccccd48041a" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Let us look into the content of <code>html_soup</code>, and as you can see it looks boringly normal:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">print(html_soup)</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full"><html> <head> <title>My HTML page</title> </head> <body> <p> This is one paragraph. </p> <p> This is another paragraph. <b>HTML</b> is cool! </p> <div> <a href="https://blog.dominodatalab.com/" id="dominodatalab">Domino Datalab Blog</a> </div> </body> </html></code></pre></div></div></div></div><p id="body__7ed9cf0ed8ca" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">But, there is more to it than you may think. Look at the type of the html_soup variable, and as you can imagine, it is no longer a string. Instead, it is a BeautifulSoup object:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">type(html_soup)</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full">bs4.BeautifulSoup</code></pre></div></div></div></div><p id="body__455bd947c315" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">As we mentioned before, Beautiful Soup helps us make sense of the tags in our HTML file. It parses the document and locates the relevant tags. We can for instance directly ask for the title of the website:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">print(html_soup.title)</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full"><title>My HTML page</title></code></pre></div></div></div></div><p id="body__1bb476d3c6b6" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Or for the text inside the title tag:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">print(html_soup.title.text)</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full">'My HTML page'</code></pre></div></div></div></div><p id="body__8df886ba1d17" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Similarly, we can look at the children of the body tag:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">list(html_soup.body.children)</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full">['\n', <p> This is one paragraph. </p>, '\n', <p> This is another paragraph. <b>HTML</b> is cool! </p>, '\n', <div> <a href="https://blog.dominodatalab.com/" id="dominodatalab">Domino Datalab Blog</a> </div>, '\n']</code></pre></div></div></div></div><p id="body__ed466d381754" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">From here, we can select the content of the first paragraph. From the list above we can see that it is the second element in the list. Remember that Python counts from 0 , so we are interested in element number 1:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">print(list(html_soup.body.children)[1])</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full"><p> This is one paragraph. </p></code></pre></div></div></div></div><p id="body__e80174a7d071" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">This works fine, but Beautiful Soup can help us even more. We can for instance find the first paragraph by referring to the <code>p</code> tag as follows:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">print(html_soup.find('p').text.strip())</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full">'This is one paragraph.'</code></pre></div></div></div></div><p id="body__e07a3522a4ff" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">We can also look for all the paragraph instances:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">for paragraph in html_soup.find_all('p'): print(paragraph.text.strip())</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full">This is one paragraph. This is another paragraph. HTML is cool!</code></pre></div></div></div></div><p id="body__387b721588c4" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Let us obtain the hyperlink referred to in our example HTML. We can do this by requesting all the <code>a</code> tags that contain an <code>href</code>:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">links = html_soup.find_all('a', href = True) print(links)</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full">[<a href="https://blog.dominodatalab.com/" id="dominodatalab">Domino Datalab Blog</a>]</code></pre></div></div></div></div><p id="body__04631eacb477" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">In this case the contents of the list links are tags themselves. Our list contains a single element and we can see its type:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">print(type(links[0]))</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full">bs4.element.Tag</code></pre></div></div></div></div><p id="body__eed88fc5edce" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">We can therefore request the attributes <code>href</code> and <code>id</code> as follows:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">print(links[0]['href'], links[0]['id'])</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full">('https://blog.dominodatalab.com/', 'dominodatalab')</code></pre></div></div></div></div><h2 id="body__119a82333558" class="text-h2-small lg:text-h2-small-lg balance-text !leading-[1.2] font-medium text-dark-background mb-5 mt-10 first:mt-0">Reading the source code of a webpage</h2><p id="body__767e04b86a61" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">We are now ready to start looking into requesting information from an actual webpage. We can do this with the help of the Requests module. Let us read the content of a previous blog post, for instance, the one on "<a href="https://domino.ai/blog/data-exploration-with-pandas-profiler-and-d-tale" to="/blog/data-exploration-with-pandas-profiler-and-d-tale" class="group transition-colors inline text-accent hover:text-accent-hover active:text-dark-background items-center" tabindex="0">Data Exploration with Pandas Profiler and D-Tale</a>"</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">import requests url = "https://blog.dominodatalab.com/data-exploration-with-pandas-profiler-and-d-tale" my_page = requests.get(url)</code></pre></div></div></div></div><p id="body__b1f517681e6f" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">A successful request of the page will return a response 200 :</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">my_page.status_code</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full">200</code></pre></div></div></div></div><p id="body__920bae75e93c" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">The content of the page can be seen with <code>my_page.content</code>. I will not show this as it will be a messy entry for this post, but you can go ahead and try it in your environment.<br/><br/>What we really want is to pass this information to Beautiful Soup so we can make sense of the tags in the document:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">blog_soup = BeautifulSoup(my_page.content, 'html.parser')</code></pre></div></div></div></div><p id="body__9c983e4ebec7" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Let us look at the heading tag <code>h1</code> that contains the heading of the page:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">blog_soup.h1</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full"><h1 class="title"> <span class="hs_cos_wrapper hs_cos_wrapper_meta_field hs_cos_wrapper_type_text" data-hs-cos-general-type="meta_field" data-hs-cos-type="text" id="hs_cos_wrapper_name" style=""> Data Exploration with Pandas Profiler and D-Tale </span> </h1></code></pre></div></div></div></div><p id="body__b83531e400ce" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">We can see that it has a few attributes and what we really want is the text inside the tag:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">heading = blog_soup.h1.textprint(heading)</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full">'Data Exploration with Pandas Profiler and D-Tale'</code></pre></div></div></div></div><p id="body__e34210972b33" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">The author of the blog post is identified in the <code>div</code> of class <code>author-link</code>, let's take a look:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">blog_author = blog_soup.find_all('div', class_="author-link") print(blog_author)</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full">[<div class="author-link"> by: <a href="//blog.dominodatalab.com/author/jrogel">Dr J Rogel-Salazar </a></div>]</code></pre></div></div></div></div><p id="body__c0f21fdf9ac9" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Note that we need to refer to <code>class_</code> (with an underscore) to avoid clashes with the Python reserved word <code>class</code>. As we can see from the result above, the <code>div</code> has a hyperlink and the name of the author is in the text of that tag:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">blog_author[0].find('a').text</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full">'Dr J Rogel-Salazar '</code></pre></div></div></div></div><p id="body__f0ad24f606cb" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">As you can see, we need to get well acquainted with the content of the source code of our page. You can use the tools that your favourite browser gives you to inspect elements of a website.</p><p id="body__a7c4d26a8eb3" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Let's say that we are now interested in getting the list of goals given in the blog post. The information is in a <code><ul></code> tag which is an unordered list, and each entry is in a <code><li></code> tag, which is a list item. The unordered list has no class or role (unlike other lists in the page):</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">blog_soup.find('ul', class_=None, role=None)</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full"><ul> <li>Detecting erroneous data.</li> <li>Determining how much missing data there is.</li> <li>Understanding the structure of the data.</li> <li>Identifying important variables in the data.</li> <li>Sense-checking the validity of the data.</li> </ul></code></pre></div></div></div></div><p id="body__d8b0da30e298" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">OK, we can now extract the entries for the HTML list and put them in a Python list:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">my_ul = blog_soup.find('ul', class_=None, role=None) li_goals =my_ul.find_all('li') goals = [] for li_goal in li_goals: v goals.append(li_goal. string) print(goals)</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full">['Detecting erroneous data.', 'Determining how much missing data there is.', 'Understanding the structure of the data.', 'Identifying important variables in the data.', 'Sense-checking the validity of the data.']</code></pre></div></div></div></div><p id="body__8fbf34e8caf9" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">As mentioned before, we could be interested in getting the text of the blog post to carry out some natural language processing. We can do that in one go with the <code>get_text()</code> method.</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">blog_text = blog_soup.get_text()</code></pre></div></div></div></div><p id="body__2192a9dbab49" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">We can now use some of the techniques described in the earlier post on <a href="https://domino.ai/blog/natural-language-in-python-using-spacy" to="/blog/natural-language-in-python-using-spacy" class="group transition-colors inline text-accent hover:text-accent-hover active:text-dark-background items-center" tabindex="0">natural language with spaCy</a>. In this case we are showing each entry, its part-of-speech (POS), the explanation for POS and whether the entry is considered a stop word or not. For convenience we are only showing the first 10 entries.</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">import spacy nlp = spacy.load("en_core_web_sm") doc = nlp(blog_text) for entry in doc[:10]: print(entry.text, entry.pos_, spacy.explain(entry.pos_), entry.is_stop)</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full"> SPACE space False Data PROPN proper noun False Exploration PROPN proper noun False with ADP adposition True Pandas PROPN proper noun False Profiler PROPN proper noun False and CCONJ coordinating conjunction True D PROPN proper noun False - PUNCT punctuation False Tale PROPN proper noun False</code></pre></div></div></div></div><h2 id="body__7b56b1ea690d" class="text-h2-small lg:text-h2-small-lg balance-text !leading-[1.2] font-medium text-dark-background mb-5 mt-10 first:mt-0">Reading table data</h2><p id="body__bcf635a8ea6e" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Finally, let us use what we have learned so far to get data that can be shown in the form of a table. We mentioned at the beginning of this post that we may want to see the number of gold medals obtained by different countries in the Olympic Games in Tokyo. We can read that information from the relevant entry in Wikipedia.</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">url = 'https://en.wikipedia.org/wiki/2020_Summer_Olympics_medal_table' wiki_page = requests.get(url) medal_soup = BeautifulSoup(wiki_page.content, 'html.parser')</code></pre></div></div></div></div><p id="body__b2d85a6cd5f7" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Using the inspect element functionality of my browser I can see that the table where the data is located has a class. See the screenshot from my browser:</p><figure class="mb-5 max-w-full"><div class="relative"><img loading="lazy" decoding="async" alt="beautiful_soup_medals_table" class="rounded-3.75" style="--img-aspect-ratio:1.6681818181818182;--img-natural-width:1468px;max-width:100vwpx" src="https://cdn.sanity.io/images/kuana2sp/production-main/7c57951c281cc41fe3f7ae371adf07cdc298eab6-1468x880.jpg?w=1920&fit=max&auto=format" srcSet="https://cdn.sanity.io/images/kuana2sp/production-main/7c57951c281cc41fe3f7ae371adf07cdc298eab6-1468x880.jpg?w=360&fit=max&auto=format 360w, https://cdn.sanity.io/images/kuana2sp/production-main/7c57951c281cc41fe3f7ae371adf07cdc298eab6-1468x880.jpg?w=414&fit=max&auto=format 414w, https://cdn.sanity.io/images/kuana2sp/production-main/7c57951c281cc41fe3f7ae371adf07cdc298eab6-1468x880.jpg?w=828&fit=max&auto=format 828w, https://cdn.sanity.io/images/kuana2sp/production-main/7c57951c281cc41fe3f7ae371adf07cdc298eab6-1468x880.jpg?w=1080&fit=max&auto=format 1080w, https://cdn.sanity.io/images/kuana2sp/production-main/7c57951c281cc41fe3f7ae371adf07cdc298eab6-1468x880.jpg?w=1366&fit=max&auto=format 1366w, https://cdn.sanity.io/images/kuana2sp/production-main/7c57951c281cc41fe3f7ae371adf07cdc298eab6-1468x880.jpg?w=1536&fit=max&auto=format 1536w" sizes="100vw" width="1468" height="880" data-loaded="false" data-above-fold="false" rel="" fetchpriority="low"/></div></figure><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">medal_table = medal_soup.find('table', class_='wikitable sortable plainrowheaders jquery-tablesorter')</code></pre></div></div></div></div><p id="body__2761a9620889" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">In this case, we need to iterate through each row (<code>tr</code>) and then assign each of its elements (<code>td</code>) to a variable and append it to a list. One exception is the heading of the table which has<code>th</code>elements.</p><p id="body__6e6699a4f585" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Let us find all the rows. We will single out the first one of those to extract the headers, and we'll store the medal information in a variable called <code>allRows</code>:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">tmp = medal_table.find_all('tr') first = tmp[0] allRows = tmp[1:-1]</code></pre></div></div></div></div><p id="body__1d5ef6044099" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Let us look a the <code>first</code> row:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full"> print(first)</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full"><tr><th scope="col">Rank</th><th scope="col">Team</th><th class="headerSort" scope="col" style="width:4em;background-color:gold">Gold</th><th class="headerSort" scope="col" style="width:4em;background-color:silver">Silver</th><th class="headerSort" scope="col" style="width:4em;background-color:#c96">Bronze</th><th scope="col" style="width:4em">Total</th></tr></code></pre></div></div></div></div><p id="body__d22e7088e5cc" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">As you can see, we need to find all the <code>th</code> tags and get the text, and furthermore we will get rid of heading and trailing spaces with the <code>strip()</code> method. We do all this within a list comprehension syntax:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full"> headers = [header.get_text().strip() for header in first.find_all('th')] print(headers)</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full">['Rank', 'Team', 'Gold', 'Silver', 'Bronze', 'Total']</code></pre></div></div></div></div><p id="body__af21e271b71f" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Cool! We now turn our attention to the medals:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full"> results = [[data.get_text() for data in row.find_all('td')] for row in allRows]</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full"> print(results[:10])</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full">[['1', '39', '41', '33', '113'], ['2', '38', '32', '18', '88'], ['3', '27', '14', '17', '58'], ['4', '22', '21', '22', '65'], ['5', '20', '28', '23', '71'], ['6', '17', '7', '22', '46'], ['7', '10', '12', '14', '36'], ['8', '10', '12', '11', '33'], ['9', '10', '11', '16', '37'], ['10', '10', '10', '20', '40']]</code></pre></div></div></div></div><p id="body__0c44e70f7682" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Hang on... that looks great but it does not have the names of the countries. Let us see the content of <code>allRows</code>:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">allRows[0]</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full"><tr><td>1</td><th scope="row" style="background-color:#f8f9fa;text-align:left"><img alt="" class="thumbborder" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/22px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/33px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/44px-Flag_of_the_United_States.svg.png 2x" width="22"/> <a href="/wiki/United_States_at_the_2020_Summer_Olympics" title="United States at the 2020 Summer Olympics">United States</a> <span style="font-size:90%;">(USA)</span></th><td>39</td><td>41</td><td>33</td><td>113</td></tr></code></pre></div></div></div></div><p id="body__ae607675fc6c" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Aha! The name of the country is in a <code>th</code> tag, and actually we can extract it from the string inside the hyperlink:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full"> countries = [[countries.find(text=True) for countries in row.find_all('a')] for row in allRows ] countries[:10]</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full">[['United States'], ['China'], ['Japan'], ['Great Britain'], ['ROC'], ['Australia'], ['Netherlands'], ['France'], ['Germany'], ['Italy']]</code></pre></div></div></div></div><p id="body__8955c551f90b" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">You can see from the table in the website that some countries have the same number of gold, silver and bronze medals and thus are given the same rank. See for instance rank 36 given to both Greece and Uganda. This has some implications in our data scraping strategy, let look at the results for entries 35 to 44 :</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">results[33:44]</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full">[['34', '2', '4', '6', '12'], ['35', '2', '2', '9', '13'], ['36', '2', '1', '1', '4'], ['2', '1', '1', '4'], ['38', '2', '1', '0', '3'], ['39', '2', '0', '2', '4'], ['2', '0', '2', '4'], ['41', '2', '0', '1', '3'], ['42', '2', '0', '0', '2'], ['2', '0', '0', '2'], ['44', '1', '6', '12', '19']]</code></pre></div></div></div></div><p id="body__2a9b2b605236" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Our rows have five entries, but those that have the same ranking actually have four entries. These entries have a <code>rowspan</code> attribute as shown in the screenshot for rank 36 below:</p><figure class="mb-5 max-w-full"><div class="relative"><img loading="lazy" decoding="async" alt="beautiful_soup_rowspan" class="rounded-3.75" style="--img-aspect-ratio:1.4817518248175183;--img-natural-width:1218px;max-width:100vwpx" src="https://cdn.sanity.io/images/kuana2sp/production-main/012456ebcc7ae414b41b0a7d9bbbca1b7adb9557-1218x822.jpg?w=1920&fit=max&auto=format" srcSet="https://cdn.sanity.io/images/kuana2sp/production-main/012456ebcc7ae414b41b0a7d9bbbca1b7adb9557-1218x822.jpg?w=360&fit=max&auto=format 360w, https://cdn.sanity.io/images/kuana2sp/production-main/012456ebcc7ae414b41b0a7d9bbbca1b7adb9557-1218x822.jpg?w=414&fit=max&auto=format 414w, https://cdn.sanity.io/images/kuana2sp/production-main/012456ebcc7ae414b41b0a7d9bbbca1b7adb9557-1218x822.jpg?w=828&fit=max&auto=format 828w, https://cdn.sanity.io/images/kuana2sp/production-main/012456ebcc7ae414b41b0a7d9bbbca1b7adb9557-1218x822.jpg?w=1080&fit=max&auto=format 1080w, https://cdn.sanity.io/images/kuana2sp/production-main/012456ebcc7ae414b41b0a7d9bbbca1b7adb9557-1218x822.jpg?w=1242&fit=max&auto=format 1242w" sizes="100vw" width="1218" height="822" data-loaded="false" data-above-fold="false" rel="" fetchpriority="low"/></div></figure><figure class="mb-5 max-w-full"><div class="relative"><img loading="lazy" decoding="async" alt="beautiful_soup_rowspan" class="rounded-3.75" style="--img-aspect-ratio:1.4817518248175183;--img-natural-width:1218px;max-width:100vwpx" src="https://cdn.sanity.io/images/kuana2sp/production-main/012456ebcc7ae414b41b0a7d9bbbca1b7adb9557-1218x822.jpg?w=1920&fit=max&auto=format" srcSet="https://cdn.sanity.io/images/kuana2sp/production-main/012456ebcc7ae414b41b0a7d9bbbca1b7adb9557-1218x822.jpg?w=360&fit=max&auto=format 360w, https://cdn.sanity.io/images/kuana2sp/production-main/012456ebcc7ae414b41b0a7d9bbbca1b7adb9557-1218x822.jpg?w=414&fit=max&auto=format 414w, https://cdn.sanity.io/images/kuana2sp/production-main/012456ebcc7ae414b41b0a7d9bbbca1b7adb9557-1218x822.jpg?w=828&fit=max&auto=format 828w, https://cdn.sanity.io/images/kuana2sp/production-main/012456ebcc7ae414b41b0a7d9bbbca1b7adb9557-1218x822.jpg?w=1080&fit=max&auto=format 1080w, https://cdn.sanity.io/images/kuana2sp/production-main/012456ebcc7ae414b41b0a7d9bbbca1b7adb9557-1218x822.jpg?w=1242&fit=max&auto=format 1242w" sizes="100vw" width="1218" height="822" data-loaded="false" data-above-fold="false" rel="" fetchpriority="low"/></div></figure><p id="body__7437ac280890" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Let us find the entries that have a <code>rowspan</code> attribute and count the number of countries that share the same rank. We will keep track of the entry number, the <code>td</code> number, the number of countries that share the same rank and rank assigned:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full"> rowspan = [] for num, tr in enumerate(allRows): tmp = [] for td_num, data in enumerate(tr.find_all('td')): if data.has_attr("rowspan"): rowspan.append((num, td_num, int(data["rowspan"]), data.get_text()))</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">print(rowspan)</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full">[(35, 0, 2, '36'), (38, 0, 2, '39'), (41, 0, 2, '42'), (45, 0, 2, '46'), (49, 0, 2, '50'), (55, 0, 2, '56'), (58, 0, 4, '59'), (62, 0, 3, '63'), (71, 0, 2, '72'), (73, 0, 3, '74'), (76, 0, 6, '77'), (85, 0, 8, '86')]</code></pre></div></div></div></div><p id="body__bf6238857b2f" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">We can now fix our <code>results</code> by inserting the correct rank in the rows that have missing values:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full"> for i in rowspan: # tr value of rowspan is in the 1st place in results for j in range(1, i[2]): # Add value in the next tr results[i[0]+j].insert(i[1], i[3])</code></pre></div></div></div></div><p id="body__11fe7436abb9" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Let us check that this worked:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full"> print(results)[33:44]</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full">[['34', '2', '4', '6', '12'], ['35', '2', '2', '9', '13'], ['36', '2', '1', '1', '4'], ['36', '2', '1', '1', '4'], ['38', '2', '1', '0', '3'], ['39', '2', '0', '2', '4'], ['39', '2', '0', '2', '4'], ['41', '2', '0', '1', '3'], ['42', '2', '0', '0', '2'], ['42', '2', '0', '0', '2'], ['44', '1', '6', '12', '19']]</code></pre></div></div></div></div><p id="body__aac23220717a" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">We can now insert the names of the countries too:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full"> for i, country in enumerate(countries): results[i].insert(1, country[0])</code></pre></div></div></div></div><p id="body__7adfc437d1f2" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">Finally, we can use our data to create a Pandas dataframe:</p><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full">import pandas as pd df = pd.DataFrame(data = results, columns = headers) df['Rank'] = df['Rank'].map(lambda x: x.replace('\n','')) df['Total'] = df['Total'].map(lambda x: x.replace('\n','')) cols = ['Rank','Gold', 'Silver', 'Bronze', 'Total'] df[cols] = df[cols].apply(pd.to_numeric) df.head()</code></pre></div></div></div></div><table class="my-10 lg:my-15 text-left align-top"><thead><tr class="border-b-accent border-b"><th class="py-2.5 pr-7.5 last:pr-0 text-accent"><p class="mb-4">1</p></th><th class="py-2.5 pr-7.5 last:pr-0 text-accent"><p class="mb-4">United States</p></th><th class="py-2.5 pr-7.5 last:pr-0 text-accent"><p class="mb-4">39</p></th><th class="py-2.5 pr-7.5 last:pr-0 text-accent"><p class="mb-4">41</p></th><th class="py-2.5 pr-7.5 last:pr-0 text-accent"><p class="mb-4">33</p></th><th class="py-2.5 pr-7.5 last:pr-0 text-accent"><p class="mb-4">113</p></th></tr></thead><tbody><tr class="border-b-accent border-b"><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">2</p></td><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">China</p></td><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">38</p></td><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">32</p></td><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">18</p></td><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">88</p></td></tr><tr class="border-b-accent border-b"><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">3</p></td><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">Japan</p></td><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">27</p></td><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">14</p></td><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">17</p></td><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">58</p></td></tr><tr class="border-b-accent border-b"><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">4</p></td><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">Great Britain</p></td><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">22</p></td><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">21</p></td><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">22</p></td><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">65</p></td></tr><tr class="border-b-accent border-b"><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">5</p></td><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">ROC</p></td><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">20</p></td><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">28</p></td><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">23</p></td><td class="pt-3.75 pb-7.5 pr-7.5 last:pr-0"><p class="mb-4">71</p></td></tr></tbody></table><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full"> df['Gold'].mean()</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full">3.6559139784946235</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="python text-sm w-full min-w-0 max-w-full"> df['Total'].mean()</code></pre></div></div></div></div><div class="relative group mb-5 w-full min-w-0"><button class="absolute top-3 right-3 px-3 py-1.5 text-xs font-medium rounded-md transition-all duration-200 z-10 bg-gray-800 text-white hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 opacity-0 group-hover:opacity-100" title="Copy code">Copy</button><div class="overflow-x-auto w-full min-w-0"><div style="min-width:0;max-width:100%;word-break:break-all;white-space:pre-wrap;overflow-wrap:break-word"><div class="code-block-wrapper" style="max-width:100%;overflow:auto"><pre><code class="text text-sm w-full min-w-0 max-w-full">11.612903225806452</code></pre></div></div></div></div><h2 id="body__eef4a3eaa287" class="text-h2-small lg:text-h2-small-lg balance-text !leading-[1.2] font-medium text-dark-background mb-5 mt-10 first:mt-0">Summary</h2><p id="body__37a326e000c8" class="text-normal lg:text-normal-lg mb-5 !leading-[1.8]">We have seen how to parse an HTML document and make sense of the tags within it with the help of Beautiful Soup. You may want to use some of the things you have learned here to get some data that otherwise may be only available in a webpage. Please remember that you should be mindful of the rights of the material you are obtaining. Read the terms and conditions of the pages you are interested in, and if in doubt it is better to err on the side of caution. One last word - web scraping depends on the given structure of the webpages you are parsing. If the pages change, it is quite likely that your code will fail. In this case, be ready to roll up your sleeves, re-inspect the HTML tags, and fix your code accordingly.</p></div><div class="h-[3px] w-full shrink-0 mt-16 mb-10" style="background:linear-gradient(to right, #ff9421 0%, #f8475e 34%, #f020b3 66%, #8636f8 100%)"></div><div class="flex flex-col gap-10"><div class="relative flex flex-col md:flex-row items-start gap-4 md:gap-7.5"><img loading="lazy" decoding="async" alt="Dr J Rogel-Salazar" class="w-27.5 h-27.5 shrink-0 grow-0 object-cover rounded-full border-2 border-black/10" style="--img-aspect-ratio:1;--img-natural-width:400px" src="https://cdn.sanity.io/images/kuana2sp/production-main/7414bb869436e197d780c0812f160c6032fcc42e-400x400.jpg?w=110&fit=max&auto=format" srcSet="https://cdn.sanity.io/images/kuana2sp/production-main/7414bb869436e197d780c0812f160c6032fcc42e-400x400.jpg?w=110&fit=max&auto=format 110w, https://cdn.sanity.io/images/kuana2sp/production-main/7414bb869436e197d780c0812f160c6032fcc42e-400x400.jpg?w=220&fit=max&auto=format 220w, https://cdn.sanity.io/images/kuana2sp/production-main/7414bb869436e197d780c0812f160c6032fcc42e-400x400.jpg?w=330&fit=max&auto=format 330w" sizes="(max-width: 110px) 100vw, 110px" width="400" height="400" data-loaded="false" data-above-fold="false" rel="" fetchpriority="low"/><div><a href="https://domino.ai/blog/author/jrogel" to="/blog/author/jrogel" class="overlay-link text-large font-medium" tabindex="0">Dr J Rogel-Salazar</a><p class="text-normal mt-4">Dr Jesus Rogel-Salazar is a Research Associate in the Photonics Group in the Department of Physics at Imperial College London. He obtained his PhD in quantum atom optics at Imperial College in the group of Professor Geoff New and in collaboration with the Bose-Einstein Condensation Group in Oxford with Professor Keith Burnett. After completion of his doctorate in 2003, he took a postdoc in the Centre for Cold Matter at Imperial and moved on to the Department of Mathematics in the Applied Analysis and Computation Group with Professor Jeff Cash.</p></div></div></div><div class="h-[3px] w-full shrink-0 mt-10 mb-0" style="background:linear-gradient(to right, #ff9421 0%, #f8475e 34%, #f020b3 66%, #8636f8 100%)"></div></div><div class="hidden lg:block basis-[30%] lg:self-stretch flex flex-col"><div class="mb-6 shrink-0"><article class="relative overflow-hidden rounded-2xl px-6 py-6 text-white"><img loading="eager" decoding="sync" alt="background gradient" class="absolute inset-0 w-full h-full object-cover pointer-events-none" aria-hidden="true" style="--img-aspect-ratio:1.2111553784860558;--img-natural-width:304px" src="https://cdn.sanity.io/images/kuana2sp/production-main/52d39f4db8b7eef92b7ac1bacedea50dc39196f3-304x251.png?w=1200&fit=max&auto=format" srcSet="" sizes="(max-width: 1200px) 100vw, 1200px" width="304" height="251" data-loaded="false" data-above-fold="true" rel="" fetchpriority="high"/><div class="relative z-10 flex flex-col gap-4"><span class="text-xs font-medium uppercase font-mono tracking-[0.15em]">Domino Platform</span><h3 class="text-xl md:text-2xl font-medium balance-text">The enterprise platform to build, deliver, and govern AI</h3><p class="text-base">Watch the 15 minute on-demand demo to get an overview of the Domino Enterprise AI Platform.</p><div class="mt-2"><a href="https://domino.ai/demo" to="/demo" aria-label="Watch demo" class="group inline-flex items-center justify-center text-sm h-auto py-2 rounded-full monospace transition-colors text-white bg-[#563FB3] hover:bg-[#8636F8] active:bg-[#8636F8] border-2 border-[#ffffff] mainPurpleSolid inline-flex px-6 py-2.5 uppercase" tabindex="0">Watch demo</a></div></div></article></div><div class="sticky top-[130px] shrink-0 rounded-2.5 px-4 py-4 md:px-6 lg:px-5 lg:pt-5 lg:pb-4 bg-light-background"><p class="uppercase text-small font-mono text-dark-background mb-2 pl-4 border-b border-light-border-2 pb-2">In this article</p><ul class="relative"><li class="relative border-b last:border-b-0 border-light-border-2 pl-2"><div class="absolute left-0 top-1/2 w-1 h-6 -translate-y-1/2 rounded-full opacity-50" style="background:linear-gradient(to bottom, #ff9421 0%, #f8475e 34%, #f020b3 66%, #8636f8 100%);height:24px" aria-hidden="true"></div><a href="#body__118c4cdccdca" class="text-small py-3 pl-2 block hover:text-accent-2 transition-colors">Web scraping</a></li><li class="relative border-b last:border-b-0 border-light-border-2 pl-2"><a href="#body__31bc966b61ce" class="text-small py-3 pl-2 block hover:text-accent-2 transition-colors">HTML Primer</a></li><li class="relative border-b last:border-b-0 border-light-border-2 pl-2"><a href="#body__52594aa4ac4d" class="text-small py-3 pl-2 block hover:text-accent-2 transition-colors">Beautiful Soup</a></li><li class="relative border-b last:border-b-0 border-light-border-2 pl-2"><a href="#body__119a82333558" class="text-small py-3 pl-2 block hover:text-accent-2 transition-colors">Reading the source code of a webpage</a></li><li class="relative border-b last:border-b-0 border-light-border-2 pl-2"><a href="#body__7b56b1ea690d" class="text-small py-3 pl-2 block hover:text-accent-2 transition-colors">Reading table data</a></li><li class="relative border-b last:border-b-0 border-light-border-2 pl-2"><a href="#body__eef4a3eaa287" class="text-small py-3 pl-2 block hover:text-accent-2 transition-colors">Summary</a></li></ul></div></div></div></div></section><div class="px-6 pt-0 pb-8 lg:hidden"><article class="relative overflow-hidden rounded-2xl px-6 py-6 text-white"><img loading="eager" decoding="sync" alt="background gradient" class="absolute inset-0 w-full h-full object-cover pointer-events-none" aria-hidden="true" style="--img-aspect-ratio:1.2111553784860558;--img-natural-width:304px" src="https://cdn.sanity.io/images/kuana2sp/production-main/52d39f4db8b7eef92b7ac1bacedea50dc39196f3-304x251.png?w=1200&fit=max&auto=format" srcSet="" sizes="(max-width: 1200px) 100vw, 1200px" width="304" height="251" data-loaded="false" data-above-fold="true" rel="" fetchpriority="high"/><div class="relative z-10 flex flex-col gap-4"><span class="text-xs font-medium uppercase font-mono tracking-[0.15em]">Domino Platform</span><h3 class="text-xl md:text-2xl font-medium balance-text">The enterprise platform to build, deliver, and govern AI</h3><p class="text-base">Watch the 15 minute on-demand demo to get an overview of the Domino Enterprise AI Platform.</p><div class="mt-2"><a href="https://domino.ai/demo" to="/demo" aria-label="Watch demo" class="group inline-flex items-center justify-center text-sm h-auto py-2 rounded-full monospace transition-colors text-white bg-[#563FB3] hover:bg-[#8636F8] active:bg-[#8636F8] border-2 border-[#ffffff] mainPurpleSolid inline-flex px-6 py-2.5 uppercase" tabindex="0">Watch demo</a></div></div></article></div><section class="mx-auto py-10 lg:py-12.5 bg-white px-3.75 md:px-10 lg:px-25 pt-0 lg:pt-12.5"><div class="mx-auto max-w-[1300px]"><div class="contentWrapper"><div id="pf-holder"></div></div></div></section></main><nav class="footer-navigation w-full px-3.75 sm:px-10 lg:px-25 pt-12.5 pb-12.5 bg-black" aria-label="Footer navigation"><div class="max-w-screen-3xl mx-auto"><div class="flex flex-col-reverse xl:flex-row gap-2.5 py-6 lg:my-6"><div class="basis-full xl:basis-1/4 text-white"><p class="text-normal text-light-contrast-muted">© 2026 Domino Data Lab, Inc. Made in San Francisco. </p></div><ul class="basis-full xl:basis-3/4 flex flex-wrap flex-col lg:flex-row justify-between xl:justify-end list-none gap-x-14 text-sm pb-3 xl:pb-0"><li class="text-light-contrast-muted lg:text-light-background pb-3 lg:pb-0"><a href="javascript: Cookiebot.withdraw()">Do not sell my personal information</a></li><li class="text-light-background pb-3 lg:pb-0"><a href="https://domino.ai/legal/privacy-policy" to="/legal/privacy-policy" aria-label="Privacy policy" class="group transition-colors flex text-light hover:text-accent-hover active:text-accent-pressed items-center" tabindex="0">Privacy policy</a></li><li class="text-light-background pb-3 lg:pb-0"><a href="https://domino.ai/legal/terms" to="/legal/terms" aria-label="Terms and conditions" class="group transition-colors flex text-light hover:text-accent-hover active:text-accent-pressed items-center" tabindex="0">Terms and conditions</a></li><li class="text-light-background pb-3 lg:pb-0"><a href="https://domino.ai/security" to="/security" aria-label="Security" class="group transition-colors flex text-light hover:text-accent-hover active:text-accent-pressed items-center" tabindex="0">Security</a></li><li class="text-light-background pb-3 lg:pb-0"><a href="https://domino.ai/legal" to="/legal" aria-label="Legal" class="group transition-colors flex text-light hover:text-accent-hover active:text-accent-pressed items-center" tabindex="0">Legal</a></li></ul></div></div></nav></div></div><script>((STORAGE_KEY, restoreKey) => { if (!window.history.state || !window.history.state.key) { let key = Math.random().toString(32).slice(2); window.history.replaceState({ key }, ""); } try { let positions = JSON.parse(sessionStorage.getItem(STORAGE_KEY) || "{}"); let storedY = positions[restoreKey || window.history.state.key]; if (typeof storedY === "number") { window.scrollTo(0, storedY); } } catch (error) { console.error(error); sessionStorage.removeItem(STORAGE_KEY); } })("positions", null)</script><link rel="modulepreload" href="https://dominodatalab-git-main-domino-data-lab.vercel.app/build/entry.client-RCIPCEBA.js"/><link rel="modulepreload" href="https://dominodatalab-git-main-domino-data-lab.vercel.app/build/_shared/chunk-MXZI65YE.js"/><link rel="modulepreload" href="https://dominodatalab-git-main-domino-data-lab.vercel.app/build/_shared/chunk-G7U333EX.js"/><link rel="modulepreload" href="https://dominodatalab-git-main-domino-data-lab.vercel.app/build/_shared/chunk-VHUHJVKA.js"/><link rel="modulepreload" href="https://dominodatalab-git-main-domino-data-lab.vercel.app/build/_shared/chunk-NGIKGSHE.js"/><link rel="modulepreload" href="https://dominodatalab-git-main-domino-data-lab.vercel.app/build/_shared/chunk-KO3OK2JV.js"/><link rel="modulepreload" href="https://dominodatalab-git-main-domino-data-lab.vercel.app/build/_shared/chunk-ADMCF34Z.js"/><link rel="modulepreload" href="https://dominodatalab-git-main-domino-data-lab.vercel.app/build/_shared/chunk-RWGK63UK.js"/><link rel="modulepreload" href="https://dominodatalab-git-main-domino-data-lab.vercel.app/build/_shared/chunk-S6GMOHSS.js"/><link rel="modulepreload" href="https://dominodatalab-git-main-domino-data-lab.vercel.app/build/_shared/chunk-4S46Q4HR.js"/><link rel="modulepreload" href="https://dominodatalab-git-main-domino-data-lab.vercel.app/build/_shared/chunk-Z73KKJ76.js"/><link rel="modulepreload" href="https://dominodatalab-git-main-domino-data-lab.vercel.app/build/_shared/chunk-27VMLXZS.js"/><link rel="modulepreload" href="https://dominodatalab-git-main-domino-data-lab.vercel.app/build/_shared/chunk-37GP4PP4.js"/><link rel="modulepreload" href="https://dominodatalab-git-main-domino-data-lab.vercel.app/build/root-M7S77CIS.js"/><link rel="modulepreload" href="https://dominodatalab-git-main-domino-data-lab.vercel.app/build/_shared/chunk-ZR4ZGGKJ.js"/><link rel="modulepreload" href="https://dominodatalab-git-main-domino-data-lab.vercel.app/build/_shared/chunk-S27TBXW3.js"/><link rel="modulepreload" href="https://dominodatalab-git-main-domino-data-lab.vercel.app/build/routes/$-4GCROFKR.js"/><script>window.__remixContext = {"url":"/blog/getting-data-with-beautiful-soup","state":{"loaderData":{"root":{"ENV":{"PUBLIC_ALGOLIA_API_KEY":"a62f645193a865bae4e3563aabe9ff08","PUBLIC_ALGOLIA_APPLICATION_ID":"KMR81XDQA0","PUBLIC_ALGOLIA_INDEX":"global","PUBLIC_BASE_URL":"https://domino.ai","PUBLIC_SANITY_DATASET":"production-main","PUBLIC_SANITY_ID":"kuana2sp","PUBLIC_SITE_NAME":"Domino Data Lab","NODE_ENV":"production"}},"routes/$":{"footer":{"_id":"footer__i18n_en","bottomLinks":[{"_key":"acaf697d74b9","_type":"navLink","label":"Privacy policy","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"6d4df3a9-0aed-45fa-90f8-7da3d1bb554b","_type":"legalPage","locale":"en","routePath":"legal/privacy-policy"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"fa4ef9ab9fee","_type":"navLink","label":"Terms and conditions","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"8732b79a-235f-4670-910a-24c1fc3b1a01","_type":"legalPage","locale":"en","routePath":"legal/terms"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"e7536a188883","_type":"navLink","label":"Security","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"c059b108-1aec-456b-983f-5deffe5343a4","_type":"page","locale":"en","routePath":"security"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"08c023f36cee","_type":"navLink","label":"Legal","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"7c9e537e-1ad9-46b5-ad08-2fdf2a3bb66c","_type":"page","locale":"en","routePath":"legal"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}}],"companyContact":["548 Market Street PMB 72800","
San Francisco, CA 94104","(415) 570-2425"],"footerCta":{"_key":null,"backgroundColor":"magenta","label":"Watch Demo","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"da00752b-26c2-4ad6-84e3-19f437288984","_type":"page","locale":"en","routePath":"demo"},"linkType":"internal","newWindow":null,"url":null,"utmParameters":null,"videoPopup":null}},"footerDescription":"Domino Data Lab empowers the largest AI-driven enterprises to build and operate AI at scale. Domino’s Enterprise AI Platform provides an integrated experience encompassing model development, MLOps, collaboration, and governance.\n\nWith Domino, global enterprises can develop better medicines, grow more productive crops, develop more competitive products, and more. Founded in 2013, Domino is backed by Sequoia Capital, Coatue Management, NVIDIA, Snowflake, and other leading investors.","footerDescriptionTitle":"Who is Domino?","footerGroups":[{"_key":"98ec7fb17a2e","subGroups":[{"_key":"2349a07e38ea","childLinks":[{"_key":"cd8f06e8e203","_type":"navLink","label":"AI infrastructure","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"3dc3a092-c25a-434c-95cb-948bba046176","_type":"platform","locale":"en","routePath":"platform/on-demand-ai-infrastructure"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"4b4a5073b8b6","_type":"navLink","label":"Data management","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"6925280a-6f96-40bd-b03c-d3b3e1d694de","_type":"page","locale":"en","routePath":"platform/data-management"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"c79a7181ee9e","_type":"navLink","label":"AI workbench","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"3d6d3d27-9433-4e6c-bbad-714b45011a87","_type":"page","locale":"en","routePath":"platform/ai-workbench"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"87654e740ff8","_type":"navLink","label":"MLOps","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"41634af0-4ede-4971-9c7d-b0293d4d9e59","_type":"page","locale":"en","routePath":"platform/mlops"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"74e223880c1b","_type":"navLink","label":"AI governance","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"46ea895a-a0ad-4f99-bfeb-a059abc60ba4","_type":"page","locale":"en","routePath":"platform/ai-governance"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"7c32908d4b5b","_type":"navLink","label":"FinOps","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"c49158fa-1b21-4a3a-bf00-be9876df2fbf","_type":"page","locale":"en","routePath":"platform/finops"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"88ceed748b18","_type":"navLink","label":"Pricing","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"UEK1ButeAwaJAU7wJD2Q46","_type":"page","locale":"en","routePath":"pricing"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"f2839034d620","_type":"navLink","label":"Security & compliance","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"c059b108-1aec-456b-983f-5deffe5343a4","_type":"page","locale":"en","routePath":"security"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"d6885d71eed6","_type":"navLink","label":"What's new","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"cc026f58-ad82-4a56-8c00-c9be9fe5eede","_type":"page","locale":"en","routePath":"whats-new-in-domino"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}}],"showTitle":false,"title":"Platform"}],"title":"Platform"},{"_key":"91de48861ad4","subGroups":[{"_key":"da55061c8c0a","childLinks":[{"_key":"1ded8a020094","_type":"navLink","label":"Life sciences","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"635d21d1-b880-48de-9ab4-361a58d4e34c","_type":"solution","locale":"en","routePath":"solutions/life-sciences"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"5b2ce980ba2f","_type":"navLink","label":"Finance","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"35f47c2e-6e73-405d-9632-e092697525d3","_type":"solution","locale":"en","routePath":"solutions/banking-financial-services-insurance"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"09b47fcf3690","_type":"navLink","label":"Public sector","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"22779caf-82fb-4ddc-8483-040d4f50923b","_type":"solution","locale":"en","routePath":"solutions/public-sector"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"a8de03c48f07","_type":"navLink","label":"Retail","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"81388cad-3cf3-4f17-a4ca-d0b402ed0cc8","_type":"solution","locale":"en","routePath":"solutions/retail"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"3aba9adc4942","_type":"navLink","label":"Manufacturing","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"db1c3462-9178-4ee2-894f-daebffc0078f","_type":"solution","locale":"en","routePath":"solutions/manufacturing"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}}],"showTitle":true,"title":"Industries"},{"_key":"97aa3ecac2fd","childLinks":[{"_key":"96a2e4599ad5","_type":"navLink","label":"Generative AI","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"5ca53caf-7887-4157-976b-648775d18e85","_type":"page","locale":"en","routePath":"solutions/generative-ai"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"48c980c3595d","_type":"navLink","label":"Cost-effective data science","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"2e4bef64-5772-4f8e-9729-127a1b36a96f","_type":"page","locale":"en","routePath":"solutions/cost-effective-ai"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"5957bd10026c","_type":"navLink","label":"Self-service data science","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"9f1e172c-f8de-4541-834d-f610e26b2153","_type":"solution","locale":"en","routePath":"solutions/self-service-data-science"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"c09dc2d7f42c","_type":"navLink","label":"Model risk management","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"80cbe903-0a32-4eae-a1ec-6bd356385c4d","_type":"solution","locale":"en","routePath":"solutions/model-risk-management"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"e79c88a17e2c","_type":"navLink","label":"Cloud data science","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"0ebbc19c-9974-4e6c-9033-d0c161520463","_type":"solution","locale":"en","routePath":"solutions/cloud-data-science"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}}],"showTitle":true,"title":"Use Cases"}],"title":"Solutions"},{"_key":"539106abde30","subGroups":[{"_key":"b267abaa84d7","childLinks":[{"_key":"15a9ea7b90f0","_type":"navLink","label":"Events","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"eventsIndex__i18n_en","_type":"collectionIndex","locale":"en","routePath":"events"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"2f24f86f2dd9","_type":"navLink","label":"Blog","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"blogCollectionIndex__i18n_en","_type":"blogCollectionIndex","locale":"en","routePath":"blog"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"596c5ad6345d","_type":"navLink","label":"Podcast","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"fcb83b27-1216-41d3-99e1-5ab84700726c","_type":"podcastShow","locale":"en","routePath":"data-science-leaders-podcast"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"ffa47932eb82","_type":"navLink","label":"Courses and certifications","link":{"deepLink":null,"eventName":null,"linkType":"external","newWindow":true,"url":"https://university.domino.ai/","utmParameters":null,"videoPopup":null}},{"_key":"5718aaa19f03","_type":"navLink","label":"Data Science Dictionary","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"dictionaryIndex__i18n_en","_type":"dictionaryIndex","locale":"en","routePath":"data-science-dictionary"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"8be961dca962","_type":"navLink","label":"Documentation","link":{"deepLink":null,"eventName":null,"linkType":"external","newWindow":true,"url":"https://docs.dominodatalab.com/","utmParameters":null,"videoPopup":null}},{"_key":"78ce9b83a68c","_type":"navLink","label":"Support","link":{"deepLink":null,"eventName":null,"linkType":"external","newWindow":true,"url":"https://support.domino.ai/support/s/","utmParameters":null,"videoPopup":null}},{"_key":"c104a74b26be","_type":"navLink","label":"Demo hub","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"fce13620-c43b-4421-87d7-1347bfaa807a","_type":"page","locale":"en","routePath":"demo-hub"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}}],"showTitle":null,"title":"Learn"}],"title":"Learn"},{"_key":"34639a4cd03c","subGroups":[{"_key":"c9a8a7ff3681","childLinks":[{"_key":"e83e2e16163c","_type":"navLink","label":"About","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"a7c8ef34-3069-448f-974c-bb86774d1066","_type":"page","locale":"en","routePath":"company"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"9d86cdfb1a92","_type":"navLink","label":"Why Domino","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"71246deb-38ab-4751-a5bc-f98c75ff7f9f","_type":"page","locale":"en","routePath":"why-domino"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}},{"_key":"48946a9e0b79","_type":"navLink","label":"Careers","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"careersIndex__i18n_en","_type":"careersCollectionIndex","locale":"en","routePath":"careers"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"b27ab851d393","_type":"navLink","label":"News and press","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"newsArticlesIndex__i18n_en","_type":"newsCollectionIndex","locale":"en","routePath":"news"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"98f5a58a5022","_type":"navLink","label":"Partners","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"partners__i18n_en","_type":"page","locale":"en","routePath":"partners"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"5b081fe3d1cf","_type":"navLink","label":"Customers","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"caseStudiesIndex__i18n_en","_type":"collectionIndex","locale":"en","routePath":"customers"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},{"_key":"32061dd79c44","_type":"navLink","label":"Contact us","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"fadfaa6f-c0c3-4e96-89ba-bbcd4e39d328","_type":"page","locale":"en","routePath":"contactus"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}}],"showTitle":null,"title":null}],"title":"Company"}],"footerNote":"© CURRENT_YEAR Domino Data Lab, Inc. Made in San Francisco.","socialLinks":[{"link":{"_key":null,"_type":"navLink","label":"X","link":{"deepLink":null,"eventName":null,"linkType":"external","newWindow":true,"url":"https://twitter.com/DominoDataLab","utmParameters":null,"videoPopup":null}},"logo":{"_type":"image","asset":{"_ref":"image-630379ae7042e321968702769b3e4cb323cbb57c-29x29-svg","_type":"reference"}},"logoHover":{"_type":"image","asset":{"_ref":"image-4d3089e71621a7833490206770123bcf927d7890-29x29-svg","_type":"reference"}}},{"link":{"_key":null,"_type":"navLink","label":"LinkedIn","link":{"deepLink":null,"eventName":null,"linkType":"external","newWindow":true,"url":"https://www.linkedin.com/company/domino-data-lab","utmParameters":null,"videoPopup":null}},"logo":{"_type":"image","asset":{"_ref":"image-98aac6f65ea81ee3c14fb188874d5f24e392a0c3-29x29-svg","_type":"reference"}},"logoHover":{"_type":"image","asset":{"_ref":"image-740cb66a5ae0d0fbd06d405ba4974ddb1e492bee-29x29-svg","_type":"reference"}}},{"link":{"_key":null,"_type":"navLink","label":"GitHub","link":{"deepLink":null,"eventName":null,"linkType":"external","newWindow":true,"url":"https://github.com/dominodatalab","utmParameters":null,"videoPopup":null}},"logo":{"_type":"image","asset":{"_ref":"image-7681e045f5c5d2fc2c0a86bd84628b2fe31e44c7-29x29-svg","_type":"reference"}},"logoHover":{"_type":"image","asset":{"_ref":"image-e4257d01752acf669029255aac4d7171af5073ec-29x29-svg","_type":"reference"}}}]},"headerNav":{"_id":"headerNav__i18n_en","featuredLabel":"Featured","navigationLinks":[{"_key":"44f6bc3524da","_type":"navFlyoutAdvanced","anchor":"Why Domino","centerColumn":[{"_key":"76da43f784c0","_type":"sections","section":{"bgColor":{"_type":"color","alpha":1,"hex":"#f7f6f8","hsl":{"_type":"hslaColor","a":1,"h":270,"l":0.9686274509803922,"s":0.12499999999999978},"hsv":{"_type":"hsvaColor","a":1,"h":270,"s":0.00806451612903223,"v":0.9725490196078431},"rgb":{"_type":"rgbaColor","a":1,"b":248,"g":246,"r":247}},"columns":[{"_key":"63c081e01db8","_type":"column","childLinks":[{"_key":"f1bfd95c3f0b","_type":"navLinkWithIconText","description":[{"_key":"450ac4eb6995","_type":"block","children":[{"_key":"2081b85a40f6","_type":"span","marks":[],"text":"Freedom to innovate"}],"markDefs":[],"style":"normal"}],"icon":null,"label":"Data scientists","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"821494e7-fe56-4563-8995-0b7435fc754d","_type":"solution","locale":"en","routePath":"solutions/data-scientists"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":false,"showArrow":true},{"_key":"b79c1fb02005","_type":"navLinkWithIconText","description":[{"_key":"a09ece525e30","_type":"block","children":[{"_key":"5e5c872b42b3","_type":"span","marks":[],"text":"Drive project delivery"}],"markDefs":[],"style":"normal"}],"icon":null,"label":"Data science leaders","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"fde01ae9-fc6b-476e-a4f9-de55341ddcf6","_type":"solution","locale":"en","routePath":"solutions/data-science-leaders"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":false,"showArrow":true},{"_key":"dc0b4c42e50d","_type":"navLinkWithIconText","description":[{"_key":"ce86eb2c92c1","_type":"block","children":[{"_key":"bd76bad866c0","_type":"span","marks":[],"text":"Secure, scalable AI"}],"markDefs":[],"style":"normal"}],"icon":null,"label":"IT leaders","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"b5193fc7-e049-48b5-98c8-f2a96d0488d3","_type":"solution","locale":"en","routePath":"solutions/it-leaders"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":false,"showArrow":true}],"title":"Built for","titleLink":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}}},{"_key":"28eb7a299748","_type":"column","childLinks":[{"_key":"3ac56c72e1eb","_type":"navLinkWithIconText","description":null,"icon":null,"label":"U.S. Navy","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"3838bc22-96c9-43e6-918c-8b58953037f9","_type":"caseStudy","locale":"en","routePath":"customers/us-navy"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"ce12d19a97ff","_type":"navLinkWithIconText","description":null,"icon":null,"label":"GSK","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"bf870828-2238-4274-baeb-b847d7d38b55","_type":"caseStudy","locale":"en","routePath":"customers/gsk"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"5cb799401e27","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Moody's","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"7RRkqBiXN2KWMwCzRQS0E2","_type":"caseStudy","locale":"en","routePath":"customers/moodys"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"bdc77a8b5896","_type":"navLinkWithIconText","description":null,"icon":null,"label":"See all case studies","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"caseStudiesIndex__i18n_en","_type":"collectionIndex","locale":"en","routePath":"customers"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":false,"showArrow":true}],"title":"Customers","titleLink":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}}},{"_key":"5e8f8b1406c9","_type":"column","childLinks":[{"_key":"170bc9a22d33","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Tools and data","link":{"deepLink":{"_type":"object","blockKey":"37c22a89e550","fieldName":"body","parentDocumentId":"partners__i18n_en"},"eventName":null,"internalLink":{"_id":"partners__i18n_en","_type":"page","locale":"en","routePath":"partners"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"3af30420c120","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Infrastructure","link":{"deepLink":{"_type":"object","blockKey":"e26136929b20","fieldName":"body","parentDocumentId":"partners__i18n_en"},"eventName":null,"internalLink":{"_id":"partners__i18n_en","_type":"page","locale":"en","routePath":"partners"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"bf88ef4c98b3","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Solutions","link":{"deepLink":{"_type":"object","blockKey":"378acefe7fc8","fieldName":"body","parentDocumentId":"partners__i18n_en"},"eventName":null,"internalLink":{"_id":"partners__i18n_en","_type":"page","locale":"en","routePath":"partners"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"7c4bec8bebe4","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Implementation","link":{"deepLink":{"_type":"object","blockKey":"af1110382954","fieldName":"body","parentDocumentId":"partners__i18n_en"},"eventName":null,"internalLink":{"_id":"partners__i18n_en","_type":"page","locale":"en","routePath":"partners"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"9e8467529c0f","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Become a partner","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"08bd29ff-8f09-4c31-b348-24913d5541d0","_type":"partner","locale":"en","routePath":"partners/become-a-partner"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":false,"showArrow":true}],"title":"Partners","titleLink":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}}}],"cta":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}}}}],"disableFlyout":null,"featured":{"action":{"_key":null,"_type":"navLink","label":"Get the report","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"0d0681ea-ac5c-461d-a03c-78dcd069ccc1","_type":"page","locale":"en","routePath":"resources/gartner-magic-quadrant-data-science-machine-learning-2025"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}},"bgColor":{"_type":"color","alpha":1,"hex":"#f1f0f4","hsl":{"_type":"hslaColor","a":1,"h":254.9999999999999,"l":0.9490196078431372,"s":0.1538461538461546},"hsv":{"_type":"hsvaColor","a":1,"h":254.9999999999999,"s":0.016393442622950876,"v":0.9568627450980393},"rgb":{"_type":"rgbaColor","a":1,"b":244,"g":240,"r":241}},"image":{"_type":"image","asset":{"_ref":"image-0e22a4863c2e8c884866de4fc75205b1ec0c3e66-1200x627-png","_type":"reference"}},"preamble":"See why Domino is recognized for the second consecutive year in the Gartner® Magic Quadrant™ report","title":"Domino Data Lab is 2x Visionary"},"leftColumn":{"bgColor":{"hex":"#ffffff"},"bottomLinksSection":{"childLinks":[{"_key":"e2888e6afa30","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Security & compliance","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"c059b108-1aec-456b-983f-5deffe5343a4","_type":"page","locale":"en","routePath":"security"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":false,"showArrow":true},{"_key":"751f175d06e1","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Professional services","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"87429763-5bac-46b9-9673-fb6de11153f1","_type":"solution","locale":"en","routePath":"solutions/professional-services"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":false,"showArrow":true}]},"topLink":{"tagLineLink":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}},"tagline":"Why Domino","title":"Scale AI with confidence and control","titleLink":{"_key":null,"_type":"navLink","label":"Learn more","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"71246deb-38ab-4751-a5bc-f98c75ff7f9f","_type":"page","locale":"en","routePath":"why-domino"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}}}}},{"_key":"610a6c918bf0","_type":"navFlyoutAdvanced","anchor":"Platform","centerColumn":[{"_key":"264c7428e7e3","_type":"sections","section":{"bgColor":{"_type":"color","alpha":1,"hex":"#f7f6f8","hsl":{"_type":"hslaColor","a":1,"h":270,"l":0.9686274509803922,"s":0.12499999999999978},"hsv":{"_type":"hsvaColor","a":1,"h":270,"s":0.00806451612903223,"v":0.9725490196078431},"rgb":{"_type":"rgbaColor","a":1,"b":248,"g":246,"r":247}},"columns":[{"_key":"f4533faeb89d","_type":"column","childLinks":[{"_key":"dae490c89ddd","_type":"navLinkWithIconText","description":[{"_key":"8e60adfc6069","_type":"block","children":[{"_key":"bb1b185224a50","_type":"span","marks":[],"text":"Unified and governed access"}],"markDefs":[],"style":"normal"}],"icon":null,"label":"AI infrastructure","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"3dc3a092-c25a-434c-95cb-948bba046176","_type":"platform","locale":"en","routePath":"platform/on-demand-ai-infrastructure"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null},"normalFontWeight":null,"showArrow":true},{"_key":"cc3cb29cc3f7","_type":"navLinkWithIconText","description":[{"_key":"9fdb07f4e594","_type":"block","children":[{"_key":"4efc91e4d2aa0","_type":"span","marks":[],"text":"Explore and prepare"}],"markDefs":[],"style":"normal"}],"icon":null,"label":"Data management","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"6925280a-6f96-40bd-b03c-d3b3e1d694de","_type":"page","locale":"en","routePath":"platform/data-management"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null},"normalFontWeight":null,"showArrow":true},{"_key":"0de68d663919","_type":"navLinkWithIconText","description":[{"_key":"fcd3f90683b0","_type":"block","children":[{"_key":"ebd800742ecb","_type":"span","marks":[],"text":"Integrate and build"}],"markDefs":[],"style":"normal"}],"icon":null,"label":"Open ecosystem","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"0c92f96e-075c-4ac3-89e7-eba1aa8eb6f6","_type":"page","locale":"en","routePath":"platform/open-ecosystem"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null},"normalFontWeight":null,"showArrow":true}],"title":"Infra & Data Access","titleLink":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}}},{"_key":"f7f9c8684f4f","_type":"column","childLinks":[{"_key":"1f96a75c8563","_type":"navLinkWithIconText","description":[{"_key":"fcd77a0afe0b","_type":"block","children":[{"_key":"9b49b2d7ce75","_type":"span","marks":[],"text":"Path to production"}],"markDefs":[],"style":"normal"}],"icon":null,"label":"Agentic AI","link":{"deepLink":null,"eventName":null,"linkType":"external","newWindow":false,"url":"https://domino.ai/platform/agentic-ai","utmParameters":null,"videoPopup":false},"normalFontWeight":false,"showArrow":true},{"_key":"f8059ee02f61","_type":"navLinkWithIconText","description":[{"_key":"665f59b280c4","_type":"block","children":[{"_key":"d156ed77cce9","_type":"span","marks":[],"text":"Train and evaluate"}],"markDefs":[],"style":"normal"}],"icon":null,"label":"AI workbench","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"3d6d3d27-9433-4e6c-bbad-714b45011a87","_type":"page","locale":"en","routePath":"platform/ai-workbench"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null},"normalFontWeight":null,"showArrow":true},{"_key":"68dd33e32729","_type":"navLinkWithIconText","description":[{"_key":"ed0f658ade5f","_type":"block","children":[{"_key":"72878db5916c","_type":"span","marks":[],"text":"Deploy and monitor"}],"markDefs":[],"style":"normal"}],"icon":null,"label":"MLOps","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"41634af0-4ede-4971-9c7d-b0293d4d9e59","_type":"page","locale":"en","routePath":"platform/mlops"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null},"normalFontWeight":null,"showArrow":true},{"_key":"ba8377eb40f5","_type":"navLinkWithIconText","description":[{"_key":"fbd94cbe2b21","_type":"block","children":[{"_key":"5e0e4d2cea3b0","_type":"span","marks":[],"text":"Deliver apps faster"}],"markDefs":[],"style":"normal"}],"icon":null,"label":"AI applications","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"2e571e79-8b67-4846-a094-944a33328480","_type":"page","locale":"en","routePath":"platform/ai-applications"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null},"normalFontWeight":null,"showArrow":true}],"title":"AI Factory","titleLink":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}}},{"_key":"9f8d566e26a1","_type":"column","childLinks":[{"_key":"69b6b07fb916","_type":"navLinkWithIconText","description":[{"_key":"3803eb3cba12","_type":"block","children":[{"_key":"a986ac2f8afc","_type":"span","marks":[],"text":"Accelerate and automate"}],"markDefs":[],"style":"normal"}],"icon":null,"label":"AI governance","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"46ea895a-a0ad-4f99-bfeb-a059abc60ba4","_type":"page","locale":"en","routePath":"platform/ai-governance"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null},"normalFontWeight":null,"showArrow":true},{"_key":"4fe1f892636b","_type":"navLinkWithIconText","description":[{"_key":"73c0ef47739e","_type":"block","children":[{"_key":"adbb6eada161","_type":"span","marks":[],"text":"Track and optimize"}],"markDefs":[],"style":"normal"}],"icon":null,"label":"FinOps","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"c49158fa-1b21-4a3a-bf00-be9876df2fbf","_type":"page","locale":"en","routePath":"platform/finops"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null},"normalFontWeight":null,"showArrow":true},{"_key":"5179d9ea9a4e","_type":"navLinkWithIconText","description":[{"_key":"c34d4da4e2d8","_type":"block","children":[{"_key":"367a0dccc575","_type":"span","marks":[],"text":"Reuse and iterate"}],"markDefs":[],"style":"normal"}],"icon":null,"label":"Reproducibility","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"c14cd79f-e82b-4103-8f9a-6529ff6967fe","_type":"page","locale":"en","routePath":"platform/reproducibility"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null},"normalFontWeight":null,"showArrow":true}],"title":"System of Record","titleLink":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}}}],"cta":{"_key":null,"_type":"navLink","label":"Explore Domino for Generative AI → ","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"b6024e10-4a60-44d4-b33a-5702956a766f","_type":"page","locale":"en","routePath":"platform/generative-ai"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}}}}],"disableFlyout":false,"featured":{"action":{"_key":null,"_type":"navLink","label":"See what's new","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"cc026f58-ad82-4a56-8c00-c9be9fe5eede","_type":"page","locale":"en","routePath":"whats-new-in-domino"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},"bgColor":{"_type":"color","alpha":1,"hex":"#f1f0f4","hsl":{"_type":"hslaColor","a":1,"h":254.9999999999999,"l":0.9490196078431372,"s":0.1538461538461546},"hsv":{"_type":"hsvaColor","a":1,"h":254.9999999999999,"s":0.016393442622950876,"v":0.9568627450980393},"rgb":{"_type":"rgbaColor","a":1,"b":244,"g":240,"r":241}},"image":{"_type":"image","alt":"What's new in Domino","asset":{"_ref":"image-bdf3e3a669dff95095a1d942f41c1428b6bdd057-272x194-png","_type":"reference"}},"preamble":"Now enterprises can build, evaluate, deploy, and monitor agentic systems at scale — with the governance and reproducibility Domino is known for.","title":"Introducing Agentic AI in Domino"},"leftColumn":{"bgColor":{"hex":"#ffffff"},"bottomLinksSection":{"childLinks":[{"_key":"44e61c5c9944","_type":"navLinkWithIconText","description":null,"icon":null,"label":"What's new","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"cc026f58-ad82-4a56-8c00-c9be9fe5eede","_type":"page","locale":"en","routePath":"whats-new-in-domino"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null},"normalFontWeight":null,"showArrow":true},{"_key":"af86dcdde458","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Pricing","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"UEK1ButeAwaJAU7wJD2Q46","_type":"page","locale":"en","routePath":"pricing"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null},"normalFontWeight":null,"showArrow":true},{"_key":"3e9d9b4d73cb","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Security & compliance","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"c059b108-1aec-456b-983f-5deffe5343a4","_type":"page","locale":"en","routePath":"security"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null},"normalFontWeight":null,"showArrow":true}]},"topLink":{"tagLineLink":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}},"tagline":"Platform","title":"Domino Enterprise AI Platform","titleLink":{"_key":null,"_type":"navLink","label":"Learn More","link":{"deepLink":{"_type":"object","blockKey":"e44237zp1vc9","fieldName":"body","parentDocumentId":"9ac4cbda-be68-4118-bb06-0bdaf7c14027"},"eventName":null,"internalLink":{"_id":"9ac4cbda-be68-4118-bb06-0bdaf7c14027","_type":"platform","locale":"en","routePath":"platform"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}}}}},{"_key":"468bb56bbecc","_type":"navFlyoutAdvanced","anchor":"Solutions","centerColumn":[{"_key":"41389fd44f3c","_type":"sections","section":{"bgColor":{"_type":"color","alpha":1,"hex":"#f7f6f8","hsl":{"_type":"hslaColor","a":1,"h":270,"l":0.9686274509803922,"s":0.12499999999999978},"hsv":{"_type":"hsvaColor","a":1,"h":270,"s":0.00806451612903223,"v":0.9725490196078431},"rgb":{"_type":"rgbaColor","a":1,"b":248,"g":246,"r":247}},"columns":[{"_key":"3fe9f85e9d55","_type":"column","childLinks":[{"_key":"f3b956fcc71b","_type":"navLinkWithIconText","description":[{"_key":"4fa425034a4d","_type":"block","children":[{"_key":"c58abbc12a76","_type":"span","marks":[],"text":"Accelerate breakthrough discovery"}],"markDefs":[],"style":"normal"}],"icon":null,"label":"Life sciences","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"635d21d1-b880-48de-9ab4-361a58d4e34c","_type":"solution","locale":"en","routePath":"solutions/life-sciences"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":false,"showArrow":true},{"_key":"71f1a97a0214","_type":"navLinkWithIconText","description":[{"_key":"34c7b5893586","_type":"block","children":[{"_key":"ec25a22c3471","_type":"span","marks":[],"text":"Mission-driven AI"}],"markDefs":[],"style":"normal"}],"icon":null,"label":"Public sector","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"22779caf-82fb-4ddc-8483-040d4f50923b","_type":"solution","locale":"en","routePath":"solutions/public-sector"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":false,"showArrow":true},{"_key":"a9367a0bc8b1","_type":"navLinkWithIconText","description":[{"_key":"b4b98cd07d4d","_type":"block","children":[{"_key":"69a31c615725","_type":"span","marks":[],"text":"Smarter, safer decisions"}],"markDefs":[],"style":"normal"}],"icon":null,"label":"Financial services","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"35f47c2e-6e73-405d-9632-e092697525d3","_type":"solution","locale":"en","routePath":"solutions/banking-financial-services-insurance"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":false,"showArrow":true}],"title":"Industry","titleLink":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}}},{"_key":"c9142bbec97c","_type":"column","childLinks":[{"_key":"877618669f34","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Statistical computing environment","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"50ecaeeb-b6a7-4a3c-babb-94dc38c7d663","_type":"solution","locale":"en","routePath":"solutions/life-sciences-sce"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"6482328ed027","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Model risk management","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"80cbe903-0a32-4eae-a1ec-6bd356385c4d","_type":"solution","locale":"en","routePath":"solutions/model-risk-management"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"7937f0a81011","_type":"navLinkWithIconText","description":null,"icon":null,"label":"NIST AI RMF","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"75d5038b-0da6-42ec-84a0-576e8deafb4d","_type":"page","locale":"en","routePath":"solutions/nist-risk-management"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"58c70051695f","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Document intelligence","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"1c8077c6-505b-4e0a-b73b-780975279946","_type":"solution","locale":"en","routePath":"solutions/document-intelligence"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"810d64dc72e8","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Real-world evidence","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"14eba6a6-1da6-4bc1-8983-7e7bd52f58d3","_type":"solution","locale":"en","routePath":"solutions/life-sciences/rwe"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"86a89aba781e","_type":"navLinkWithIconText","description":null,"icon":null,"label":"View all use cases","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"7b8050c8-2923-4188-944c-9d111d661b79","_type":"page","locale":"en","routePath":"solutions/use-cases"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":false,"showArrow":true}],"title":"Use cases","titleLink":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}}},{"_key":"efe634646ab7","_type":"column","childLinks":[{"_key":"794c4efc0c11","_type":"navLinkWithIconText","description":null,"icon":null,"label":"AI governance","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"ae671951-a180-44d1-9a21-bc6149f693e8","_type":"page","locale":"en","routePath":"solutions/enterprise-ai-governance"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"f34dad05902e","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Vibe modeling","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"8b7db42b-15c5-4d92-9af2-77c75f1dced4","_type":"solution","locale":"en","routePath":"solutions/vibe-modeling"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"3121a7e71406","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Responsible AI","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"9e172d71-644e-4649-a22c-400a288e3e40","_type":"page","locale":"en","routePath":"solutions/responsible-ai"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"efcdd6f3aec0","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Generative AI","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"5ca53caf-7887-4157-976b-648775d18e85","_type":"page","locale":"en","routePath":"solutions/generative-ai"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null}],"title":"Innovations","titleLink":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}}}],"cta":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}}}}],"disableFlyout":false,"featured":{"action":{"_key":null,"_type":"navLink","label":"Read more","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"d38f796d-41f0-4e6f-a717-0ffeffa45202","_type":"blogPost","locale":"en","routePath":"blog/what-changes-with-sr-26-2"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}},"bgColor":{"_type":"color","alpha":1,"hex":"#f1f0f4","hsl":{"_type":"hslaColor","a":1,"h":254.9999999999999,"l":0.9490196078431372,"s":0.1538461538461546},"hsv":{"_type":"hsvaColor","a":1,"h":254.9999999999999,"s":0.016393442622950876,"v":0.9568627450980393},"rgb":{"_type":"rgbaColor","a":1,"b":244,"g":240,"r":241}},"image":{"_type":"image","alt":"SR 26-2","asset":{"_ref":"image-8846dafcf512163db361e72f3835d62eb1b5933d-620x360-png","_type":"reference"}},"preamble":"New model risk management guidance for U.S. banking","title":"What changes with SR 26-2?"},"leftColumn":{"bgColor":{"hex":"#ffffff"},"bottomLinksSection":{"childLinks":[{"_key":"c1c79f110026","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Professional services","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"87429763-5bac-46b9-9673-fb6de11153f1","_type":"solution","locale":"en","routePath":"solutions/professional-services"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":false,"showArrow":true}]},"topLink":{"tagLineLink":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}},"tagline":"Solutions","title":"Solutions for every industry","titleLink":{"_key":null,"_type":"navLink","label":"Explore now","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"solutionsIndex__i18n_en","_type":"solutionsIndex","locale":"en","routePath":"solutions"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}}}}},{"_key":"36e8566b9b7b","_type":"navFlyoutAdvanced","anchor":"Learn","centerColumn":[{"_key":"37796737a3b7","_type":"sections","section":{"bgColor":{"_type":"color","alpha":1,"hex":"#f7f6f8","hsl":{"_type":"hslaColor","a":1,"h":270,"l":0.9686274509803922,"s":0.12499999999999978},"hsv":{"_type":"hsvaColor","a":1,"h":270,"s":0.00806451612903223,"v":0.9725490196078431},"rgb":{"_type":"rgbaColor","a":1,"b":248,"g":246,"r":247}},"columns":[{"_key":"4d028e52727a","_type":"column","childLinks":[{"_key":"957b5bc43139","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Resources","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"resourcesIndex__i18n_en","_type":"resourcesIndex","locale":"en","routePath":"resources"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"b496096df72b","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Events","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"eventsIndex__i18n_en","_type":"collectionIndex","locale":"en","routePath":"events"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"d0adf157efc1","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Blog","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"blogCollectionIndex__i18n_en","_type":"blogCollectionIndex","locale":"en","routePath":"blog"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"7a12af203b4e","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Podcast","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"fcb83b27-1216-41d3-99e1-5ab84700726c","_type":"podcastShow","locale":"en","routePath":"data-science-leaders-podcast"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"549117ee8b66","_type":"navLinkWithIconText","description":null,"icon":null,"label":"RevX on demand","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"b45e4a88-679c-4ed8-965d-e4d1732fed37","_type":"page","locale":"en","routePath":"revx/videos"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"6eb9d8bde7c0","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Dictionary","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"dictionaryIndex__i18n_en","_type":"dictionaryIndex","locale":"en","routePath":"data-science-dictionary"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"956a1c4608d2","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Demo hub","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"fce13620-c43b-4421-87d7-1347bfaa807a","_type":"page","locale":"en","routePath":"demo-hub"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null}],"title":"Learn","titleLink":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}}},{"_key":"a0d43ef4c3f4","_type":"column","childLinks":[{"_key":"129b0d626371","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Governance maturity assessment","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"grmLandingPage__i18n_en","_type":"grmLandingPage","locale":"en","routePath":"tools/governance-maturity-assessment"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"e52ad8a0ae05","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Model velocity assessment","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"mvaLandingPage__i18n_en","_type":"mvaLandingPage","locale":"en","routePath":"resources/data-science-process-lifecycle-assessment"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"ed6149aa5772","_type":"navLinkWithIconText","description":null,"icon":null,"label":"ROI calculator","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"5131cba7-a2bc-473c-9ffa-4b655233c1a8","_type":"page","locale":"en","routePath":"tools/roi-calculator"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null}],"title":"Tools","titleLink":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}}},{"_key":"91d4387ed565","_type":"column","childLinks":[{"_key":"c1621ee590a9","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Customer tech hour","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"techHourCollection__i18n_en","_type":"collectionIndex","locale":"en","routePath":"customer-tech-hour"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"1bfd1a33fff3","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Domino blueprints","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"e9005e5b-5645-4b0a-b228-a3e935044c00","_type":"page","locale":"en","routePath":"resources/blueprints"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"397906db713e","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Courses and certifications","link":{"deepLink":null,"eventName":null,"linkType":"external","newWindow":true,"url":"https://university.dominodatalab.com/","utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"8119e04047d1","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Documentation","link":{"deepLink":null,"eventName":null,"linkType":"external","newWindow":true,"url":"https://docs.dominodatalab.com/","utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"a3c3506e84e1","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Support","link":{"deepLink":null,"eventName":null,"linkType":"external","newWindow":true,"url":"https://support.domino.ai/support/s/","utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null}],"title":"Technical resources","titleLink":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}}}],"cta":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}}}}],"disableFlyout":null,"featured":{"action":{"_key":null,"_type":"navLink","label":"Register now","link":{"deepLink":null,"eventName":null,"linkType":"external","newWindow":false,"url":"https://rev.domino.ai?utm_campaign=rev_26&utm_content=featLink","utmParameters":null,"videoPopup":false}},"bgColor":{"_type":"color","alpha":1,"hex":"#f1f0f4","hsl":{"_type":"hslaColor","a":1,"h":254.9999999999999,"l":0.9490196078431372,"s":0.1538461538461546},"hsv":{"_type":"hsvaColor","a":1,"h":254.9999999999999,"s":0.016393442622950876,"v":0.9568627450980393},"rgb":{"_type":"rgbaColor","a":1,"b":244,"g":240,"r":241}},"image":{"_type":"image","alt":"Rev 2026","asset":{"_ref":"image-3dfeff1a8207fec23f9f2b96607f4248e9abab22-1535x939-png","_type":"reference"}},"preamble":"Rev is where leaders define how to scale impact. Join Domino for Rev 2026 — Philadelphia, New York, and London.","title":"Rev 2026"},"leftColumn":{"bgColor":{"hex":"#ffffff"},"bottomLinksSection":{"childLinks":[{"_key":"48d19622abdc","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Demo hub","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"fce13620-c43b-4421-87d7-1347bfaa807a","_type":"page","locale":"en","routePath":"demo-hub"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":false,"showArrow":true},{"_key":"857c7bde2b60","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Request a demo","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"e0c9ee1b-879a-4c6d-acc9-987a1b71b11f","_type":"page","locale":"en","routePath":"request-a-demo"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":false,"showArrow":true}]},"topLink":{"tagLineLink":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}},"tagline":"Demo","title":"See Domino in action!","titleLink":{"_key":null,"_type":"navLink","label":"Watch demo","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"da00752b-26c2-4ad6-84e3-19f437288984","_type":"page","locale":"en","routePath":"demo"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}}}}},{"_key":"0c84431213ce","_type":"navFlyoutAdvanced","anchor":"Company","centerColumn":[{"_key":"3d70dfa691ac","_type":"sections","section":{"bgColor":{"_type":"color","alpha":1,"hex":"#f7f6f8","hsl":{"_type":"hslaColor","a":1,"h":270,"l":0.9686274509803922,"s":0.12499999999999978},"hsv":{"_type":"hsvaColor","a":1,"h":270,"s":0.00806451612903223,"v":0.9725490196078431},"rgb":{"_type":"rgbaColor","a":1,"b":248,"g":246,"r":247}},"columns":[{"_key":"9fe35d3410c6","_type":"column","childLinks":[{"_key":"8ac2a69beeae","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Open positions","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"careersIndex__i18n_en","_type":"careersCollectionIndex","locale":"en","routePath":"careers"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":false},{"_key":"8dee09bd6f5d","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Campus recruiting","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"b278e262-afc9-43f6-ad23-b229d25b6cf1","_type":"landingPage","locale":"en","routePath":"careers/campus-recruiting"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"0c198e987233","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Careers blog","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"50c7dec9-f419-4c62-9e32-c165ddc22883","_type":"page","locale":"en","routePath":"careers/blog"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null}],"title":"Careers","titleLink":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}}},{"_key":"c6c9944faf4b","_type":"column","childLinks":[{"_key":"51f8e160aa9c","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Security and compliance","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"c059b108-1aec-456b-983f-5deffe5343a4","_type":"page","locale":"en","routePath":"security"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"95eae5cf6d9f","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Legal center","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"7c9e537e-1ad9-46b5-ad08-2fdf2a3bb66c","_type":"page","locale":"en","routePath":"legal"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"2f6f2d73eb75","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Support","link":{"deepLink":null,"eventName":null,"linkType":"external","newWindow":true,"url":"https://support.domino.ai/support/s","utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null}],"title":"Security and trust","titleLink":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}}},{"_key":"58670085692e","_type":"column","childLinks":[{"_key":"b93e959191e2","_type":"navLinkWithIconText","description":null,"icon":null,"label":"About Domino","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"a7c8ef34-3069-448f-974c-bb86774d1066","_type":"page","locale":"en","routePath":"company"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"be5d141d100f","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Domino in the news","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"newsArticlesIndex__i18n_en","_type":"newsCollectionIndex","locale":"en","routePath":"news"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"c5eec27c4607","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Press releases","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"pressReleasesIndex__i18n_en","_type":"newsCollectionIndex","locale":"en","routePath":"news/press-releases"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null},{"_key":"9b89c3bd1994","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Contact us","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"fadfaa6f-c0c3-4e96-89ba-bbcd4e39d328","_type":"page","locale":"en","routePath":"contactus"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":true,"showArrow":null}],"title":"Company","titleLink":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}}}],"cta":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}}}}],"disableFlyout":null,"featured":{"action":{"_key":null,"_type":"navLink","label":"Read more","link":{"deepLink":null,"eventName":null,"linkType":"external","newWindow":true,"url":"https://www.reuters.com/world/middle-east/us-navy-turns-ai-firm-domino-options-counter-iranian-mines-2026-05-01/","utmParameters":null,"videoPopup":false}},"bgColor":{"_type":"color","alpha":1,"hex":"#f1f0f4","hsl":{"_type":"hslaColor","a":1,"h":254.9999999999999,"l":0.9490196078431372,"s":0.1538461538461546},"hsv":{"_type":"hsvaColor","a":1,"h":254.9999999999999,"s":0.016393442622950876,"v":0.9568627450980393},"rgb":{"_type":"rgbaColor","a":1,"b":244,"g":240,"r":241}},"image":{"_type":"image","alt":"Domino and the U.S. Navy","asset":{"_ref":"image-06592dd127d2b5f189c390c8aef9f45bf8fc979d-600x315-png","_type":"reference"}},"preamble":"Domino is an integral component of the Navy's modern MLOps pipelines for AI-based mine countermeasures.","title":"Domino x U.S. Navy"},"leftColumn":{"bgColor":{"hex":"#ffffff"},"bottomLinksSection":{"childLinks":[{"_key":"ca6538c4ec88","_type":"navLinkWithIconText","description":null,"icon":null,"label":"Contact us","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"fadfaa6f-c0c3-4e96-89ba-bbcd4e39d328","_type":"page","locale":"en","routePath":"contactus"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false},"normalFontWeight":false,"showArrow":true}]},"topLink":{"tagLineLink":{"_key":null,"_type":"navLink","label":null,"link":{"deepLink":null,"eventName":null,"linkType":"none","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}},"tagline":"About us","title":"Enterprise AI built for innovation","titleLink":{"_key":null,"_type":"navLink","label":"Learn more","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"a7c8ef34-3069-448f-974c-bb86774d1066","_type":"page","locale":"en","routePath":"company"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}}}}},{"_key":"acf1e65a8b09","_type":"navLink","label":"Rev 2026","link":{"deepLink":null,"eventName":null,"linkType":"external","newWindow":true,"url":"https://rev.domino.ai?utm_campaign=rev_26&utm_content=navLink","utmParameters":null,"videoPopup":false}}],"primaryAction":{"_key":null,"backgroundColor":"magenta","label":"Watch Demo","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"da00752b-26c2-4ad6-84e3-19f437288984","_type":"page","locale":"en","routePath":"demo"},"linkType":"internal","newWindow":false,"url":"https://g.v","utmParameters":null,"videoPopup":null}},"secondaryAction":{"_key":null,"backgroundColor":"dark-grey","label":"Contact us","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"fadfaa6f-c0c3-4e96-89ba-bbcd4e39d328","_type":"page","locale":"en","routePath":"contactus"},"linkType":"internal","newWindow":null,"url":null,"utmParameters":null,"videoPopup":null}}},"isPreview":false,"isSanityData":true,"locale":"en","routeData":{"_createdAt":"2023-05-04T09:53:50Z","_id":"wN4plt0FLLjYOsFTYnzoEL","_rev":"RRw39DNmiEpzucqSW6jali","_type":"blogPost","_updatedAt":"2023-09-01T11:26:04Z","author":{"_id":"bQvS44cO2Kqh2eurYx38m4","_type":"blogPostAuthor","bio":"Dr Jesus Rogel-Salazar is a Research Associate in the Photonics Group in the Department of Physics at Imperial College London. He obtained his PhD in quantum atom optics at Imperial College in the group of Professor Geoff New and in collaboration with the Bose-Einstein Condensation Group in Oxford with Professor Keith Burnett. After completion of his doctorate in 2003, he took a postdoc in the Centre for Cold Matter at Imperial and moved on to the Department of Mathematics in the Applied Analysis and Computation Group with Professor Jeff Cash.","image":{"_type":"image","alt":"Dr J Rogel-Salazar","asset":{"_ref":"image-7414bb869436e197d780c0812f160c6032fcc42e-400x400-jpg","_type":"reference"}},"locale":"en","name":"Dr J Rogel-Salazar","routePath":"blog/author/jrogel"},"blocks":null,"blogAd":{"_id":"086d1ba4-eaaa-446c-b100-cd76cc499f27","backgroundImage":{"_type":"image","alt":"background gradient","asset":{"_ref":"image-52d39f4db8b7eef92b7ac1bacedea50dc39196f3-304x251-png","_type":"reference"}},"bgColorSelect":null,"cta":{"_key":null,"label":"Watch demo","link":{"deepLink":null,"eventName":null,"internalLink":{"_id":"da00752b-26c2-4ad6-84e3-19f437288984","_type":"page","locale":"en","routePath":"demo"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":false}},"eyebrow":"Domino Platform","subtitle":"Watch the 15 minute on-demand demo to get an overview of the Domino Enterprise AI Platform.","textWhite":true,"title":"The enterprise platform to build, deliver, and govern AI"},"body":[{"_key":"f3dfb12a9a5f","_type":"block","children":[{"_key":"f3dfb12a9a5f0","_type":"span","marks":[],"text":"Data is all around us, from the spreadsheets we analyse on a daily basis, to the weather forecast we rely on every morning or the webpages we read. In many cases, the data we consume is simply given to us, and a simple glance is enough to make a decision. For example, knowing that the chance of rain today is 75% all day makes me take my umbrella with me. In many other cases, the data provided is so rich that we need to roll up our sleeves and we may use some exploratory analysis to get our heads around it. We have talked about some useful packages to do this exploration in "},{"_key":"f3dfb12a9a5f1","_type":"span","marks":["2dd3364e7599"],"text":"a previous post"},{"_key":"f3dfb12a9a5f2","_type":"span","marks":[],"text":"."}],"markDefs":[{"_key":"2dd3364e7599","_type":"link","deepLink":null,"eventName":null,"internalLink":{"_id":"o1Y7BUrMcHK7lLTrzH8DzU","_type":"blogPost","locale":"en","routePath":"blog/data-exploration-with-pandas-profiler-and-d-tale"},"linkType":"internal","newWindow":null,"url":null,"utmParameters":null,"videoPopup":null}],"style":"normal"},{"_key":"33a806296238","_type":"block","children":[{"_key":"33a8062962380","_type":"span","marks":[],"text":"However, the data we require may not always be given to us in a format that is suitable for immediate manipulation. It may be the case that the data can be obtained from an Application Programming Interface (API). Or we may connect directly to a database to obtain the information we require."}],"markDefs":[],"style":"normal"},{"_key":"d188bfadea54","_type":"block","children":[{"_key":"d188bfadea540","_type":"span","marks":[],"text":"Another rich source of data is the web and you may have obtained some useful data points from it already. Simply visit your favourite Wikipedia page and you may discover how many gold medals each country has won in the recent Olympic Games in Tokyo. Webpages are also rich in textual content and although you may copy and paste this information, or even type it into your text editor of choice, web scraping may be a method to consider. In another previous post we talked about "},{"_key":"d188bfadea541","_type":"span","marks":["cfb3a387c822"],"text":"natural language processing"},{"_key":"d188bfadea542","_type":"span","marks":[],"text":" and extracted text from some webpages. In this post we are going to use a Python module called "},{"_key":"d188bfadea543","_type":"span","marks":["6f060839e4ec"],"text":"Beautiful Soup"},{"_key":"d188bfadea544","_type":"span","marks":[],"text":" to facilitate the process of data acquisition."}],"markDefs":[{"_key":"cfb3a387c822","_type":"link","deepLink":null,"eventName":null,"internalLink":{"_id":"bQvS44cO2Kqh2eurZ5pLts","_type":"blogPost","locale":"en","routePath":"blog/natural-language-in-python-using-spacy"},"linkType":"internal","newWindow":null,"url":null,"utmParameters":null,"videoPopup":null},{"_key":"6f060839e4ec","_type":"link","deepLink":null,"eventName":null,"linkType":"external","newWindow":true,"url":"https://www.crummy.com/software/BeautifulSoup/","utmParameters":null,"videoPopup":null}],"style":"normal"},{"_key":"118c4cdccdca","_type":"block","children":[{"_key":"118c4cdccdca0","_type":"span","marks":[],"text":"Web scraping"}],"markDefs":[],"style":"h2"},{"_key":"51b2c6b1aa7d","_type":"block","children":[{"_key":"51b2c6b1aa7d0","_type":"span","marks":[],"text":"We can create a program that enables us to grab the pages we are interested in and obtain the information we are after. This is known as web scraping and the code we write requires us to obtain the source code of the web pages that contain the information. In other words, we need to parse the HTML that makes up the page to extract the data. In a nutshell, we need to complete the following steps:"}],"markDefs":[],"style":"normal"},{"_key":"e10761ef1670","_type":"block","children":[{"_key":"e10761ef16700","_type":"span","marks":[],"text":"Identify the webpage with the information we need"}],"level":1,"listItem":"number","markDefs":[],"style":"normal"},{"_key":"fc640df1d141","_type":"block","children":[{"_key":"fc640df1d1410","_type":"span","marks":[],"text":"Download the source code"}],"level":1,"listItem":"number","markDefs":[],"style":"normal"},{"_key":"fec48109ab83","_type":"block","children":[{"_key":"fec48109ab830","_type":"span","marks":[],"text":"Identify the elements of the page that hold the information we need"}],"level":1,"listItem":"number","markDefs":[],"style":"normal"},{"_key":"58198bb06b5a","_type":"block","children":[{"_key":"58198bb06b5a0","_type":"span","marks":[],"text":"Extract and clean the information"}],"level":1,"listItem":"number","markDefs":[],"style":"normal"},{"_key":"8c18a5c51494","_type":"block","children":[{"_key":"8c18a5c514940","_type":"span","marks":[],"text":"Format and save the data for further analysis"}],"level":1,"listItem":"number","markDefs":[],"style":"normal"},{"_key":"f98fb7ddba49","_type":"block","children":[{"_key":"f98fb7ddba490","_type":"span","marks":[],"text":"Please note that not all pages let you scrape their content and others do not offer a clear cut view on this. We recommend that you check the terms and conditions for the pages you are after and adhere to them. It may be the case that there is an API you can use to get the data, and there are often additional benefits for using it instead of scraping directly."}],"markDefs":[],"style":"normal"},{"_key":"31bc966b61ce","_type":"block","children":[{"_key":"31bc966b61ce0","_type":"span","marks":[],"text":"HTML Primer"}],"markDefs":[],"style":"h2"},{"_key":"dab11ebb9cb8","_type":"block","children":[{"_key":"dab11ebb9cb80","_type":"span","marks":[],"text":"As mentioned above, we will need to understand the structure of an HTML file to find our way around it. The way a webpage renders its content is described via HTML (or HyperText Markup Language), which provides detailed instructions indicating the format, style, and structure for the pages so that a browser can render things correctly."}],"markDefs":[],"style":"normal"},{"_key":"cabbbaace5e4","_type":"block","children":[{"_key":"cabbbaace5e40","_type":"span","marks":[],"text":"HTML uses tags to flag key structure elements. A tag is denoted by using the "},{"_key":"cabbbaace5e41","_type":"span","marks":["code"],"text":"<"},{"_key":"cabbbaace5e42","_type":"span","marks":[],"text":" and "},{"_key":"cabbbaace5e43","_type":"span","marks":["code"],"text":">"},{"_key":"cabbbaace5e44","_type":"span","marks":[],"text":" symbols. We are also required to indicate where the tagged elements start and finish. For a tag called"},{"_key":"cabbbaace5e45","_type":"span","marks":["code"],"text":"mytag"},{"_key":"cabbbaace5e46","_type":"span","marks":[],"text":", we denote the beginning of the tagged content as"},{"_key":"cabbbaace5e47","_type":"span","marks":["code"],"text":"<mytag>"},{"_key":"cabbbaace5e48","_type":"span","marks":[],"text":"and its end with"},{"_key":"cabbbaace5e49","_type":"span","marks":["code"],"text":"</mytag>"},{"_key":"cabbbaace5e410","_type":"span","marks":[],"text":"."}],"markDefs":[],"style":"normal"},{"_key":"aa13dc72d4ba","_type":"block","children":[{"_key":"aa13dc72d4ba0","_type":"span","marks":[],"text":"The most basic HTML tag is the "},{"_key":"aa13dc72d4ba1","_type":"span","marks":["code"],"text":"<html>"},{"_key":"aa13dc72d4ba2","_type":"span","marks":[],"text":" tag, and it tells the browser that everything between the tags is HTML. The simplest HTML document is therefore defined as:"}],"markDefs":[],"style":"normal"},{"_key":"a27702d386e5","_type":"block.code","content":{"code":"<html></html>","language":"text"}},{"_key":"ee38fbd12bef","_type":"block","children":[{"_key":"ee38fbd12bef0","_type":"span","marks":[],"text":"The document above is empty. Let us look at a more useful example:"}],"markDefs":[],"style":"normal"},{"_key":"af6b3748139b","_type":"block.code","content":{"code":"<html>\n<head>\n <title>My HTML page\n\n\n

\n This is one paragraph.\n

\n

\n This is another paragraph. HTML is cool!\n

\n
\n Domino Datalab Blog\n
\n\n","language":"text"}},{"_key":"e4b0277b35ad","_type":"block","children":[{"_key":"e4b0277b35ad0","_type":"span","marks":[],"text":"We can see the HTML tag we had before. This time we have other tags inside it. We call tags inside another tag \"children,\" and as you can imagine, tags can have \"parents\". In the document above "},{"_key":"e4b0277b35ad1","_type":"span","marks":["code"],"text":""},{"_key":"e4b0277b35ad2","_type":"span","marks":[],"text":" and "},{"_key":"e4b0277b35ad3","_type":"span","marks":["code"],"text":""},{"_key":"e4b0277b35ad4","_type":"span","marks":[],"text":" are children of "},{"_key":"e4b0277b35ad5","_type":"span","marks":["code"],"text":""},{"_key":"e4b0277b35ad6","_type":"span","marks":[],"text":" and in turn they are siblings. A nice family!"}],"markDefs":[],"style":"normal"},{"_key":"75ba6bc80ae4","_type":"block","children":[{"_key":"75ba6bc80ae40","_type":"span","marks":[],"text":"There are a few tags there:"}],"markDefs":[],"style":"normal"},{"_key":"a8401d15742c","_type":"block","children":[{"_key":"a8401d15742c0","_type":"span","marks":["code"],"text":""},{"_key":"a8401d15742c1","_type":"span","marks":[],"text":" contains metadata for the webpage, for example the page title"}],"level":1,"listItem":"bullet","markDefs":[],"style":"normal"},{"_key":"927e6655e73e","_type":"block","children":[{"_key":"927e6655e73e0","_type":"span","marks":["code"],"text":""},{"_key":"927e6655e73e1","_type":"span","marks":[],"text":" is the title of the page"}],"level":1,"listItem":"bullet","markDefs":[],"style":"normal"},{"_key":"ec9ad982e275","_type":"block","children":[{"_key":"ec9ad982e2750","_type":"span","marks":["code"],"text":"<body>"},{"_key":"ec9ad982e2751","_type":"span","marks":[],"text":" defines the body of the page"}],"level":1,"listItem":"bullet","markDefs":[],"style":"normal"},{"_key":"368f1c6e9f78","_type":"block","children":[{"_key":"368f1c6e9f780","_type":"span","marks":["code"],"text":"<p>"},{"_key":"368f1c6e9f781","_type":"span","marks":[],"text":" is a paragraph"}],"level":1,"listItem":"bullet","markDefs":[],"style":"normal"},{"_key":"9b9093384cdd","_type":"block","children":[{"_key":"9b9093384cdd0","_type":"span","marks":["code"],"text":"<div>"},{"_key":"9b9093384cdd1","_type":"span","marks":[],"text":" is a division or area of the page"}],"level":1,"listItem":"bullet","markDefs":[],"style":"normal"},{"_key":"86d3a758bc65","_type":"block","children":[{"_key":"86d3a758bc650","_type":"span","marks":["code"],"text":"<b>"},{"_key":"86d3a758bc651","_type":"span","marks":[],"text":" indicates bold font-weight"}],"level":1,"listItem":"bullet","markDefs":[],"style":"normal"},{"_key":"cbc286ebd0b9","_type":"block","children":[{"_key":"cbc286ebd0b90","_type":"span","marks":["code"],"text":"<a>"},{"_key":"cbc286ebd0b91","_type":"span","marks":[],"text":" is a hyperlink, and in the example above it contains two attributes "},{"_key":"cbc286ebd0b92","_type":"span","marks":["code"],"text":"href"},{"_key":"cbc286ebd0b93","_type":"span","marks":[],"text":" which indicates the link's destination and an identifier called "},{"_key":"cbc286ebd0b94","_type":"span","marks":["code"],"text":"id"},{"_key":"cbc286ebd0b95","_type":"span","marks":[],"text":"."}],"level":1,"listItem":"bullet","markDefs":[],"style":"normal"},{"_key":"9a64e5bc24e8","_type":"block","children":[{"_key":"9a64e5bc24e80","_type":"span","marks":[],"text":"OK, let us now try to parse this page."}],"markDefs":[],"style":"normal"},{"_key":"52594aa4ac4d","_type":"block","children":[{"_key":"52594aa4ac4d0","_type":"span","marks":[],"text":"Beautiful Soup"}],"markDefs":[],"style":"h2"},{"_key":"5deebaff8c8f","_type":"block","children":[{"_key":"5deebaff8c8f0","_type":"span","marks":[],"text":"If we save the content of the HTML document described above and opened it in a browser we will see something like this:"}],"markDefs":[],"style":"normal"},{"_key":"2e2ba7b8c097","_type":"block","children":[{"_key":"2e2ba7b8c0970","_type":"span","marks":[],"text":"This is one paragraph.\n\n\n\nThis is another paragraph. "},{"_key":"2e2ba7b8c0971","_type":"span","marks":["strong"],"text":"HTML"},{"_key":"2e2ba7b8c0972","_type":"span","marks":[],"text":" is cool!\n\n\n\n"},{"_key":"2e2ba7b8c0973","_type":"span","marks":["463b94f6e88f"],"text":"Domino Datalab Blog"}],"markDefs":[{"_key":"463b94f6e88f","_type":"link","deepLink":null,"eventName":null,"internalLink":{"_id":"blogCollectionIndex__i18n_en","_type":"blogCollectionIndex","locale":"en","routePath":"blog"},"linkType":"internal","newWindow":null,"url":null,"utmParameters":null,"videoPopup":null}],"style":"blockquote"},{"_key":"5fac2f41246f","_type":"block","children":[{"_key":"5fac2f41246f0","_type":"span","marks":[],"text":"However, we are interested in extracting this information for further use. We could manually copy and paste the data, but fortunately we don't need to - we have Beautiful Soup to help us."}],"markDefs":[],"style":"normal"},{"_key":"832ab5ca8220","_type":"block","children":[{"_key":"832ab5ca82200","_type":"span","marks":[],"text":"Beautiful Soup is a Python module that is able to make sense of the tags inside HTML and XML documents. You can take a look at the module's page "},{"_key":"832ab5ca82201","_type":"span","marks":["be47451fba5d"],"text":"here"},{"_key":"832ab5ca82202","_type":"span","marks":[],"text":"."}],"markDefs":[{"_key":"be47451fba5d","_type":"link","deepLink":null,"eventName":null,"linkType":"external","newWindow":true,"url":"https://beautiful-soup-4.readthedocs.io/en/latest/","utmParameters":null,"videoPopup":null}],"style":"normal"},{"_key":"a540650c6eda","_type":"block","children":[{"_key":"a540650c6eda0","_type":"span","marks":[],"text":"Let us create a string with the content of our HTML. We will see how to read content from a live webpage later on."}],"markDefs":[],"style":"normal"},{"_key":"10c25942dc03","_type":"block.code","content":{"code":"my_html = \"\"\"\n<html>\n<head>\n<title>My HTML page\n\n\n

\nThis is one paragraph.\n

\n

\nThis is another paragraph. HTML is cool!\n

\n
\nDomino Datalab Blog\n
\n\n\"\"\"","language":"python"}},{"_key":"a834e211eefa","_type":"block","children":[{"_key":"a834e211eefa0","_type":"span","marks":[],"text":"We can now import Beautiful Soup and read the string as follows:"}],"markDefs":[],"style":"normal"},{"_key":"9eb8bf7a500e","_type":"block.code","content":{"code":"from bs4 import BeautifulSoup\nhtml_soup = BeautifulSoup(my_html, 'html.parser')","language":"python"}},{"_key":"2ccccd48041a","_type":"block","children":[{"_key":"2ccccd48041a0","_type":"span","marks":[],"text":"Let us look into the content of "},{"_key":"2ccccd48041a1","_type":"span","marks":["code"],"text":"html_soup"},{"_key":"2ccccd48041a2","_type":"span","marks":[],"text":", and as you can see it looks boringly normal:"}],"markDefs":[],"style":"normal"},{"_key":"6dc53111c491","_type":"block.code","content":{"code":"print(html_soup)","language":"python"}},{"_key":"d6970c8d513f","_type":"block.code","content":{"code":"\n\nMy HTML page\n\n\n

\n This is one paragraph.\n

\n

\n This is another paragraph. HTML is cool!\n

\n
\n Domino Datalab Blog\n
\n\n","language":"text"}},{"_key":"7ed9cf0ed8ca","_type":"block","children":[{"_key":"7ed9cf0ed8ca0","_type":"span","marks":[],"text":"But, there is more to it than you may think. Look at the type of the html_soup variable, and as you can imagine, it is no longer a string. Instead, it is a BeautifulSoup object:"}],"markDefs":[],"style":"normal"},{"_key":"ff5d9558006c","_type":"block.code","content":{"code":"type(html_soup)","language":"python"}},{"_key":"0b1eb5f042b0","_type":"block.code","content":{"code":"bs4.BeautifulSoup","language":"text"}},{"_key":"455bd947c315","_type":"block","children":[{"_key":"455bd947c3150","_type":"span","marks":[],"text":"As we mentioned before, Beautiful Soup helps us make sense of the tags in our HTML file. It parses the document and locates the relevant tags. We can for instance directly ask for the title of the website:"}],"markDefs":[],"style":"normal"},{"_key":"9294b0dcb79f","_type":"block.code","content":{"code":"print(html_soup.title)","language":"python"}},{"_key":"41ea9e845011","_type":"block.code","content":{"code":"My HTML page","language":"text"}},{"_key":"1bb476d3c6b6","_type":"block","children":[{"_key":"1bb476d3c6b60","_type":"span","marks":[],"text":"Or for the text inside the title tag:"}],"markDefs":[],"style":"normal"},{"_key":"f727c405d076","_type":"block.code","content":{"code":"print(html_soup.title.text)","language":"python"}},{"_key":"492e27c5404b","_type":"block.code","content":{"code":"'My HTML page'","language":"text"}},{"_key":"8df886ba1d17","_type":"block","children":[{"_key":"8df886ba1d170","_type":"span","marks":[],"text":"Similarly, we can look at the children of the body tag:"}],"markDefs":[],"style":"normal"},{"_key":"2ea9c47974de","_type":"block.code","content":{"code":"list(html_soup.body.children)","language":"python"}},{"_key":"f16144bf5d6b","_type":"block.code","content":{"code":"['\\n',\n

\nThis is one paragraph.\n

,\n'\\n',\n

\nThis is another paragraph. HTML is cool!\n

,\n'\\n',\n
\nDomino Datalab Blog\n
,\n'\\n']","language":"text"}},{"_key":"ed466d381754","_type":"block","children":[{"_key":"ed466d3817540","_type":"span","marks":[],"text":"From here, we can select the content of the first paragraph. From the list above we can see that it is the second element in the list. Remember that Python counts from 0 , so we are interested in element number 1:"}],"markDefs":[],"style":"normal"},{"_key":"630e069e551b","_type":"block.code","content":{"code":"print(list(html_soup.body.children)[1])","language":"python"}},{"_key":"e410f968f29c","_type":"block.code","content":{"code":"

\nThis is one paragraph.\n

","language":"text"}},{"_key":"e80174a7d071","_type":"block","children":[{"_key":"e80174a7d0710","_type":"span","marks":[],"text":"This works fine, but Beautiful Soup can help us even more. We can for instance find the first paragraph by referring to the "},{"_key":"e80174a7d0711","_type":"span","marks":["code"],"text":"p"},{"_key":"e80174a7d0712","_type":"span","marks":[],"text":" tag as follows:"}],"markDefs":[],"style":"normal"},{"_key":"f4046f252c40","_type":"block.code","content":{"code":"print(html_soup.find('p').text.strip())","language":"python"}},{"_key":"5b9f4ba94269","_type":"block.code","content":{"code":"'This is one paragraph.'","language":"text"}},{"_key":"e07a3522a4ff","_type":"block","children":[{"_key":"e07a3522a4ff0","_type":"span","marks":[],"text":"We can also look for all the paragraph instances:"}],"markDefs":[],"style":"normal"},{"_key":"36622e99fd1c","_type":"block.code","content":{"code":"for paragraph in html_soup.find_all('p'):\n print(paragraph.text.strip())","language":"python"}},{"_key":"190bc676b871","_type":"block.code","content":{"code":"This is one paragraph.\nThis is another paragraph. HTML is cool!","language":"text"}},{"_key":"387b721588c4","_type":"block","children":[{"_key":"387b721588c40","_type":"span","marks":[],"text":"Let us obtain the hyperlink referred to in our example HTML. We can do this by requesting all the "},{"_key":"387b721588c41","_type":"span","marks":["code"],"text":"a"},{"_key":"387b721588c42","_type":"span","marks":[],"text":" tags that contain an "},{"_key":"387b721588c43","_type":"span","marks":["code"],"text":"href"},{"_key":"387b721588c44","_type":"span","marks":[],"text":":"}],"markDefs":[],"style":"normal"},{"_key":"619a5549b248","_type":"block.code","content":{"code":"links = html_soup.find_all('a', href = True)\nprint(links)","language":"python"}},{"_key":"80931d7012ce","_type":"block.code","content":{"code":"[Domino Datalab Blog]","language":"text"}},{"_key":"04631eacb477","_type":"block","children":[{"_key":"04631eacb4770","_type":"span","marks":[],"text":"In this case the contents of the list links are tags themselves. Our list contains a single element and we can see its type:"}],"markDefs":[],"style":"normal"},{"_key":"e1439dee819b","_type":"block.code","content":{"code":"print(type(links[0]))","language":"python"}},{"_key":"78378bbd37c2","_type":"block.code","content":{"code":"bs4.element.Tag","language":"text"}},{"_key":"eed88fc5edce","_type":"block","children":[{"_key":"eed88fc5edce0","_type":"span","marks":[],"text":"We can therefore request the attributes "},{"_key":"eed88fc5edce1","_type":"span","marks":["code"],"text":"href"},{"_key":"eed88fc5edce2","_type":"span","marks":[],"text":" and "},{"_key":"eed88fc5edce3","_type":"span","marks":["code"],"text":"id"},{"_key":"eed88fc5edce4","_type":"span","marks":[],"text":" as follows:"}],"markDefs":[],"style":"normal"},{"_key":"310ccdce500b","_type":"block.code","content":{"code":"print(links[0]['href'], links[0]['id'])","language":"python"}},{"_key":"9c3c543c2644","_type":"block.code","content":{"code":"('https://blog.dominodatalab.com/', 'dominodatalab')","language":"text"}},{"_key":"119a82333558","_type":"block","children":[{"_key":"119a823335580","_type":"span","marks":[],"text":"Reading the source code of a webpage"}],"markDefs":[],"style":"h2"},{"_key":"767e04b86a61","_type":"block","children":[{"_key":"767e04b86a610","_type":"span","marks":[],"text":"We are now ready to start looking into requesting information from an actual webpage. We can do this with the help of the Requests module. Let us read the content of a previous blog post, for instance, the one on \""},{"_key":"493c5cca707b","_type":"span","marks":["3a2436658caf"],"text":"Data Exploration with Pandas Profiler and D-Tale"},{"_key":"40836883278f","_type":"span","marks":[],"text":"\""}],"markDefs":[{"_key":"3a2436658caf","_type":"link","deepLink":null,"eventName":null,"internalLink":{"_id":"o1Y7BUrMcHK7lLTrzH8DzU","_type":"blogPost","locale":"en","routePath":"blog/data-exploration-with-pandas-profiler-and-d-tale"},"linkType":"internal","newWindow":false,"url":null,"utmParameters":null,"videoPopup":null}],"style":"normal"},{"_key":"7f06a6246557","_type":"block.code","content":{"code":"import requests\nurl = \"https://blog.dominodatalab.com/data-exploration-with-pandas-profiler-and-d-tale\"\nmy_page = requests.get(url)","language":"python"}},{"_key":"b1f517681e6f","_type":"block","children":[{"_key":"b1f517681e6f0","_type":"span","marks":[],"text":"A successful request of the page will return a response 200 :"}],"markDefs":[],"style":"normal"},{"_key":"66f4af1a3a7c","_type":"block.code","content":{"code":"my_page.status_code","language":"python"}},{"_key":"604abf642fdb","_type":"block.code","content":{"code":"200","language":"text"}},{"_key":"920bae75e93c","_type":"block","children":[{"_key":"920bae75e93c0","_type":"span","marks":[],"text":"The content of the page can be seen with "},{"_key":"920bae75e93c1","_type":"span","marks":["code"],"text":"my_page.content"},{"_key":"920bae75e93c2","_type":"span","marks":[],"text":". I will not show this as it will be a messy entry for this post, but you can go ahead and try it in your environment.\n\nWhat we really want is to pass this information to Beautiful Soup so we can make sense of the tags in the document:"}],"markDefs":[],"style":"normal"},{"_key":"d0a2bfdd0bc6","_type":"block.code","content":{"code":"blog_soup = BeautifulSoup(my_page.content, 'html.parser')","language":"python"}},{"_key":"9c983e4ebec7","_type":"block","children":[{"_key":"9c983e4ebec70","_type":"span","marks":[],"text":"Let us look at the heading tag "},{"_key":"9c983e4ebec71","_type":"span","marks":["code"],"text":"h1"},{"_key":"9c983e4ebec72","_type":"span","marks":[],"text":" that contains the heading of the page:"}],"markDefs":[],"style":"normal"},{"_key":"16d03f7e9176","_type":"block.code","content":{"code":"blog_soup.h1","language":"python"}},{"_key":"d7d699b2275a","_type":"block.code","content":{"code":"

\n \n Data Exploration with Pandas Profiler and D-Tale\n \n

","language":"text"}},{"_key":"b83531e400ce","_type":"block","children":[{"_key":"b83531e400ce0","_type":"span","marks":[],"text":"We can see that it has a few attributes and what we really want is the text inside the tag:"}],"markDefs":[],"style":"normal"},{"_key":"5788ec926832","_type":"block.code","content":{"code":"heading = blog_soup.h1.textprint(heading)","language":"python"}},{"_key":"55c814b9588e","_type":"block.code","content":{"code":"'Data Exploration with Pandas Profiler and D-Tale'","language":"text"}},{"_key":"e34210972b33","_type":"block","children":[{"_key":"e34210972b330","_type":"span","marks":[],"text":"The author of the blog post is identified in the "},{"_key":"e34210972b331","_type":"span","marks":["code"],"text":"div"},{"_key":"e34210972b332","_type":"span","marks":[],"text":" of class "},{"_key":"e34210972b333","_type":"span","marks":["code"],"text":"author-link"},{"_key":"e34210972b334","_type":"span","marks":[],"text":", let's take a look:"}],"markDefs":[],"style":"normal"},{"_key":"b905ec20e44f","_type":"block.code","content":{"code":"blog_author = blog_soup.find_all('div', class_=\"author-link\")\nprint(blog_author)","language":"python"}},{"_key":"2baca1370326","_type":"block.code","content":{"code":"[
by: Dr J Rogel-Salazar
]","language":"text"}},{"_key":"c0f21fdf9ac9","_type":"block","children":[{"_key":"c0f21fdf9ac90","_type":"span","marks":[],"text":"Note that we need to refer to "},{"_key":"c0f21fdf9ac91","_type":"span","marks":["code"],"text":"class_"},{"_key":"c0f21fdf9ac92","_type":"span","marks":[],"text":" (with an underscore) to avoid clashes with the Python reserved word "},{"_key":"c0f21fdf9ac93","_type":"span","marks":["code"],"text":"class"},{"_key":"c0f21fdf9ac94","_type":"span","marks":[],"text":". As we can see from the result above, the "},{"_key":"c0f21fdf9ac95","_type":"span","marks":["code"],"text":"div"},{"_key":"c0f21fdf9ac96","_type":"span","marks":[],"text":" has a hyperlink and the name of the author is in the text of that tag:"}],"markDefs":[],"style":"normal"},{"_key":"d2066ee171d8","_type":"block.code","content":{"code":"blog_author[0].find('a').text","language":"python"}},{"_key":"8d9b20baeb85","_type":"block.code","content":{"code":"'Dr J Rogel-Salazar '","language":"text"}},{"_key":"f0ad24f606cb","_type":"block","children":[{"_key":"f0ad24f606cb0","_type":"span","marks":[],"text":"As you can see, we need to get well acquainted with the content of the source code of our page. You can use the tools that your favourite browser gives you to inspect elements of a website."}],"markDefs":[],"style":"normal"},{"_key":"a7c4d26a8eb3","_type":"block","children":[{"_key":"a7c4d26a8eb30","_type":"span","marks":[],"text":"Let's say that we are now interested in getting the list of goals given in the blog post. The information is in a "},{"_key":"a7c4d26a8eb31","_type":"span","marks":["code"],"text":"